FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering
Tatiana Zemskova, Solomon Andryushenko, Ilya Obrubov, Viktoriia Khoruzhaia, Ekaterina Eroshenko, Ekaterina Derevyanka, Dmitry Yudin

TL;DR
FocusGraph introduces a novel graph-based frame selection framework that enhances long video question answering by efficiently selecting relevant keyframes, improving accuracy and reducing inference time for egocentric videos.
Contribution
The paper presents a new framework combining a scene-caption LLM selector and a training-free keyframe selection method for improved long video understanding.
Findings
Achieves state-of-the-art results on egocentric long-video QA benchmarks.
Reduces inference time significantly compared to baseline methods.
Effectively selects query-relevant keyframes without relying on original frame sequences.
Abstract
The ability to understand long videos is vital for embodied intelligent agents, because their effectiveness depends on how well they can accumulate, organize, and leverage long-horizon perceptual memories. Recently, multimodal LLMs have been gaining popularity for solving the long video understanding task due to their general ability to understand natural language and to leverage world knowledge. However, as the number of frames provided to an MLLM increases, the quality of its responses tends to degrade, and inference time grows. Therefore, when using MLLMs for long video understanding, a crucial step is selecting key frames from the video to answer user queries. In this work, we develop FocusGraph, a framework for keyframe selection for question answering over long egocentric videos. It leverages a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
