Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models
Jinhui Yi, Syed Talal Wasim, Yanan Luo, Muzammal Naseer, Juergen Gall

TL;DR
Video-Panda introduces a lightweight, encoder-free model for video-language understanding that significantly reduces computational costs while maintaining or improving performance on key benchmarks.
Contribution
The paper proposes a novel Spatio-Temporal Alignment Block (STAB) that processes videos without pre-trained encoders, reducing parameters by over 6.5 times and increasing processing speed.
Findings
Achieves comparable or better performance than encoder-based models on video question answering.
Outperforms Video-ChatGPT and Video-LLaVA in accuracy and temporal understanding.
Runs 3-4 times faster than previous methods.
Abstract
We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current video-language models typically rely on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B parameters), creating a substantial computational burden when processing multi-frame videos. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders while using only 45M parameters for visual processing - at least a 6.5 reduction compared to traditional approaches. The STAB architecture combines Local Spatio-Temporal Encoding for fine-grained feature extraction, efficient spatial downsampling through learned attention and separate mechanisms for modeling frame-level and video-level relationships. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
MethodsSoftmax · Attention Is All You Need
