Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

Shaoguang Wang; Weiyu Guo; Ziyang Chen; Yijie Xu; Xuming Hu; Hui Xiong

arXiv:2508.03337·cs.CV·April 22, 2026

Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

Shaoguang Wang, Weiyu Guo, Ziyang Chen, Yijie Xu, Xuming Hu, Hui Xiong

PDF

TL;DR

This paper introduces a token-efficient Video-QA framework combining adaptive frame-pruning and semantic graph integration, significantly reducing token usage while improving accuracy.

Contribution

It proposes a novel refinement framework that addresses visual echoes and enhances selector robustness, outperforming existing methods in token efficiency and accuracy.

Findings

01

Reduced total input tokens by up to 82.2%

02

Improved robustness and accuracy of upstream selectors

03

Achieved state-of-the-art performance on Video-QA benchmarks

Abstract

The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While keyframe selection is the dominant strategy for mitigating this, we identify a critical flaw: even state-of-the-art selectors produce prompts suffering from significant temporal redundancy, a challenge unique to video that we term 'visual echoes'. This issue leads to context dilution and can paradoxically degrade performance. To address this dual challenge, we propose a novel refinement framework that synergistically combines Adaptive Frame-Pruning(AFP) with a lightweight text-based semantic graph. AFP intelligently prunes 'visual echoes' by adaptively clustering frames, while the semantic graph provides crucial, low-cost semantic compensation. Conducting extensive experiments on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.