CoPE-VideoLM: Leveraging Codec Primitives For Efficient Video Language Modeling

Sayan Deb Sarkar; R\'emi Pautrat; Ondrej Miksik; Marc Pollefeys; Iro Armeni; Mahdi Rad; Mihai Dusmanu

arXiv:2602.13191·cs.CV·March 31, 2026

CoPE-VideoLM: Leveraging Codec Primitives For Efficient Video Language Modeling

Sayan Deb Sarkar, R\'emi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu

PDF

TL;DR

CoPE-VideoLM introduces a novel approach leveraging video codec primitives to improve efficiency and performance in video language modeling, significantly reducing computational costs while maintaining or enhancing accuracy across diverse benchmarks.

Contribution

The paper presents a lightweight transformer-based method that uses codec primitives for efficient video encoding, enabling faster convergence and reduced resource usage in video language models.

Findings

01

Reduces time-to-first-token by up to 86%.

02

Decreases token usage by up to 93%.

03

Maintains or exceeds performance on 14 video understanding benchmarks.

Abstract

Video Language Models (VideoLMs) enable AI systems to understand temporal dynamics in videos. To fit within the maximum context window constraint, current methods use keyframe sampling which often misses both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. We address these limitations by leveraging video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.