FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering

Tatiana Zemskova; Solomon Andryushenko; Ilya Obrubov; Viktoriia Khoruzhaia; Ekaterina Eroshenko; Ekaterina Derevyanka; Dmitry Yudin

arXiv:2603.04349·cs.CV·March 5, 2026

FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering

Tatiana Zemskova, Solomon Andryushenko, Ilya Obrubov, Viktoriia Khoruzhaia, Ekaterina Eroshenko, Ekaterina Derevyanka, Dmitry Yudin

PDF

Open Access

TL;DR

FocusGraph introduces a novel graph-based frame selection framework that enhances long video question answering by efficiently selecting relevant keyframes, improving accuracy and reducing inference time for egocentric videos.

Contribution

The paper presents a new framework combining a scene-caption LLM selector and a training-free keyframe selection method for improved long video understanding.

Findings

01

Achieves state-of-the-art results on egocentric long-video QA benchmarks.

02

Reduces inference time significantly compared to baseline methods.

03

Effectively selects query-relevant keyframes without relying on original frame sequences.

Abstract

The ability to understand long videos is vital for embodied intelligent agents, because their effectiveness depends on how well they can accumulate, organize, and leverage long-horizon perceptual memories. Recently, multimodal LLMs have been gaining popularity for solving the long video understanding task due to their general ability to understand natural language and to leverage world knowledge. However, as the number of frames provided to an MLLM increases, the quality of its responses tends to degrade, and inference time grows. Therefore, when using MLLMs for long video understanding, a crucial step is selecting key frames from the video to answer user queries. In this work, we develop FocusGraph, a framework for keyframe selection for question answering over long egocentric videos. It leverages a lightweight trainable Scene-Caption LLM Selector that selects query-relevant clips…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition