LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

Jihwan Kim; Nikhil Parthasarathy; Danfeng Qin; Junhwa Hur; Deqing Sun; Bohyung Han; Ming-Hsuan Yang; Boqing Gong

arXiv:2605.17260·cs.CV·May 19, 2026

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

Jihwan Kim, Nikhil Parthasarathy, Danfeng Qin, Junhwa Hur, Deqing Sun, Bohyung Han, Ming-Hsuan Yang, Boqing Gong

PDF

1 Repo

TL;DR

LiteFrame introduces an efficient video encoder for Video LLMs, significantly reducing latency and enabling longer video processing without sacrificing accuracy.

Contribution

The paper proposes LiteFrame, a novel efficient video encoder backbone trained with Compressed Token Distillation to improve latency and accuracy in Video LLMs.

Findings

01

35% reduction in end-to-end latency

02

Processes 8× more frames with the same compute

03

Improves average video understanding accuracy

Abstract

The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on "post-hoc" token reduction -- reducing visual tokens after feature extraction to alleviate the LLM's computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder. To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for Video LLMs. To train LiteFrame, we propose Compressed Token Distillation (CTD), a novel training framework that teaches a compact student vision encoder to directly predict information-dense, spatio-temporally compressed representations produced by a large teacher vision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jjihwan/LiteFrame
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.