Progressive Video Condensation with MLLM Agent for Long-form Video Understanding
Yufei Yin, Yuchen Xing, Qianke Meng, Minghao Chen, Yan Yang, Zhou Yu

TL;DR
ProVCA is a progressive video condensation method that efficiently identifies keyframes for long video understanding, improving zero-shot accuracy while reducing computational load.
Contribution
It introduces a multi-granularity, iterative approach to locate relevant video segments and keyframes for effective MLLM-based reasoning in long videos.
Findings
Achieves state-of-the-art zero-shot accuracy on EgoSchema, NExT-QA, and IntentQA datasets.
Uses fewer frames than previous training-free methods, demonstrating efficiency.
Employs a progressive narrowing approach from coarse segments to fine keyframes.
Abstract
Understanding long videos requires extracting query-relevant information from long sequences under tight compute budgets. Existing text-then-LLM pipelines lose fine-grained visual cues, while video-based multimodal large language models (MLLMs) can keep visual details but are too frame-hungry and computationally expensive. In this work, we aim to harness MLLMs for efficient video understanding. We propose ProVCA, a progressive video condensation agent that iteratively locates key video frames at multiple granularities. ProVCA first adopts a segment localization module to identify the video segment relevant to the query, then a snippet selection module to select important snippets based on similarity, and finally a keyframe refinement module to pinpoint specific keyframes in those snippets. By progressively narrowing the scope from coarse segments to fine frames, ProVCA identifies a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
