Adds two extra generations per frame. Expect many minutes on CPU.
Model: alpharomercoma/vqwen-qformer-tiktok-v2
· EVA-CLIP-G/14 + Q-Former + frozen Linear projector + Qwen3-4B (LoRA merged) ·
Audio transcription: openai/whisper-base ·
Running on CPU in bfloat16.
End-to-end on free CPU: ~3-5 min per video (audio + 1 frame × ~30-90s of generation).
For snappier inference, duplicate this Space and select ZeroGPU hardware.