Enhancing Long Video Question Answering with Scene-Localized Frame Grouping

Xuyi Yang; Wenhao Zhang; Hongbo Jin; Lin Liu; Hongbo Xu; Yongwei Nie; Fei Yu; Fei Ma

arXiv:2508.03009·cs.CV·August 6, 2025

Enhancing Long Video Question Answering with Scene-Localized Frame Grouping

Xuyi Yang, Wenhao Zhang, Hongbo Jin, Lin Liu, Hongbo Xu, Yongwei Nie, Fei Yu, Fei Ma

PDF

TL;DR

This paper introduces SceneQA, a new long video question-answering scenario, along with the LVSQA dataset, and proposes SLFG, a scene-localized frame grouping method that improves multimodal large language models' understanding of long videos without altering their architecture.

Contribution

The paper presents SceneQA and LVSQA for evaluating scene-based reasoning in long videos, and introduces SLFG, a novel scene-localized frame grouping technique that enhances model performance.

Findings

01

SLFG significantly improves long video understanding in benchmarks.

02

LVSQA provides a fair evaluation of scene perception abilities.

03

The approach requires no modification to existing models.

Abstract

Current Multimodal Large Language Models (MLLMs) often perform poorly in long video understanding, primarily due to resource limitations that prevent them from processing all video frames and their associated information. Efficiently extracting relevant information becomes a challenging task. Existing frameworks and evaluation tasks focus on identifying specific frames containing core objects from a large number of irrelevant frames, which does not align with the practical needs of real-world applications. To address this issue, we propose a new scenario under the video question-answering task, SceneQA, which emphasizes scene-based detail perception and reasoning abilities. And we develop the LVSQA dataset to support the SceneQA task, which is built upon carefully selected videos from LVBench and contains a new collection of question-answer pairs to promote a more fair evaluation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.