LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs
Jingfeng Chen, Jiawen Qian, Wendi Deng, Yinuo Guo, Jiaqi Yu, Sicong Leng, Raghuveer Thirukovalluru, Bhuwan Dhingra

TL;DR
LDDR is a novel, training-free frame sampling method for video understanding in multimodal large language models, improving efficiency and accuracy by selecting informative frames under budget constraints.
Contribution
It introduces a query-aware DPP-based sampling framework with dynamic resolution allocation, outperforming existing methods across multiple benchmarks.
Findings
LDDR achieves 3x runtime speedup over standard DPP methods.
It outperforms baselines by 2.5 points under budget constraints.
It improves video understanding across various MLLM backbones.
Abstract
Video understanding in multimodal large language models requires selecting informative frames from long, redundant videos under limited visual-token budgets. Existing methods often rely on uniform sampling, point-wise relevance scoring, chunk-wise selection, or agentic exploration, which either miss global dependencies or introduce substantial overhead. We propose LDDR (Linear DPP-Based Dynamic Resolution), a training-free, plug-and-play, and budget-aware video frame sampling framework. LDDR performs query-aware Determinantal Point Process (DPP) frame selection in a task-conditioned feature space, achieving a 3x runtime speedup over standard DPP baselines. It further introduces a Group DPP importance metric to guide frame retention and dynamic resolution allocation, assigning more tokens to informative, non-redundant frames while downscaling or pruning less useful ones. Across four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
