Video-Panda: Parameter-efficient Alignment for Encoder-free   Video-Language Models

Jinhui Yi; Syed Talal Wasim; Yanan Luo; Muzammal Naseer; Juergen Gall

arXiv:2412.18609·cs.CV·March 28, 2025

Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

Jinhui Yi, Syed Talal Wasim, Yanan Luo, Muzammal Naseer, Juergen Gall

PDF

Open Access 1 Repo 1 Models

TL;DR

Video-Panda introduces a lightweight, encoder-free model for video-language understanding that significantly reduces computational costs while maintaining or improving performance on key benchmarks.

Contribution

The paper proposes a novel Spatio-Temporal Alignment Block (STAB) that processes videos without pre-trained encoders, reducing parameters by over 6.5 times and increasing processing speed.

Findings

01

Achieves comparable or better performance than encoder-based models on video question answering.

02

Outperforms Video-ChatGPT and Video-LLaVA in accuracy and temporal understanding.

03

Runs 3-4 times faster than previous methods.

Abstract

We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead. Current video-language models typically rely on heavyweight image encoders (300M-1.1B parameters) or video encoders (1B-1.4B parameters), creating a substantial computational burden when processing multi-frame videos. Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders while using only 45M parameters for visual processing - at least a 6.5 $\times$ reduction compared to traditional approaches. The STAB architecture combines Local Spatio-Temporal Encoding for fine-grained feature extraction, efficient spatial downsampling through learned attention and separate mechanisms for modeling frame-level and video-level relationships. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jh-yi/video-panda
pytorchOfficial

Models

🤗
jh-yi/Video-Panda-7B
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition

MethodsSoftmax · Attention Is All You Need