Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration
Shaoguang Wang, Weiyu Guo, Ziyang Chen, Yijie Xu, Xuming Hu, Hui Xiong

TL;DR
This paper introduces a token-efficient Video-QA framework combining adaptive frame-pruning and semantic graph integration, significantly reducing token usage while improving accuracy.
Contribution
It proposes a novel refinement framework that addresses visual echoes and enhances selector robustness, outperforming existing methods in token efficiency and accuracy.
Findings
Reduced total input tokens by up to 82.2%
Improved robustness and accuracy of upstream selectors
Achieved state-of-the-art performance on Video-QA benchmarks
Abstract
The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While keyframe selection is the dominant strategy for mitigating this, we identify a critical flaw: even state-of-the-art selectors produce prompts suffering from significant temporal redundancy, a challenge unique to video that we term 'visual echoes'. This issue leads to context dilution and can paradoxically degrade performance. To address this dual challenge, we propose a novel refinement framework that synergistically combines Adaptive Frame-Pruning(AFP) with a lightweight text-based semantic graph. AFP intelligently prunes 'visual echoes' by adaptively clustering frames, while the semantic graph provides crucial, low-cost semantic compensation. Conducting extensive experiments on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
