1 3
Adds two extra generations per frame. Expect many minutes on CPU.
Model: alpharomercoma/vqwen-qformer-tiktok-v2 · EVA-CLIP-G/14 + Q-Former + frozen Linear projector + Qwen3-4B (LoRA merged) · Audio transcription: openai/whisper-base · Running on CPU in bfloat16. End-to-end on free CPU: ~3-5 min per video (audio + 1 frame × ~30-90s of generation). For snappier inference, duplicate this Space and select ZeroGPU hardware.