CoPE-VideoLM: Leveraging Codec Primitives For Efficient Video Language Modeling
Sayan Deb Sarkar, R\'emi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu

TL;DR
CoPE-VideoLM introduces a novel approach leveraging video codec primitives to improve efficiency and performance in video language modeling, significantly reducing computational costs while maintaining or enhancing accuracy across diverse benchmarks.
Contribution
The paper presents a lightweight transformer-based method that uses codec primitives for efficient video encoding, enabling faster convergence and reduced resource usage in video language models.
Findings
Reduces time-to-first-token by up to 86%.
Decreases token usage by up to 93%.
Maintains or exceeds performance on 14 video understanding benchmarks.
Abstract
Video Language Models (VideoLMs) enable AI systems to understand temporal dynamics in videos. To fit within the maximum context window constraint, current methods use keyframe sampling which often misses both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. We address these limitations by leveraging video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
