EEA: Exploration-Exploitation Agent for Long Video Understanding
Te Yang, Xiangyu Zhu, Bo Wang, Quan Chen, Peng Jiang, Zhen Lei

TL;DR
EEA introduces a semantic-guided hierarchical search framework for long video understanding, balancing exploration and exploitation to improve efficiency and coverage in identifying critical information.
Contribution
The paper presents EEA, a novel agent that dynamically guides exploration in long videos using semantic queries and hierarchical search, enhancing efficiency and accuracy.
Findings
EEA outperforms existing methods on multiple long-video benchmarks.
EEA achieves higher coverage of relevant segments with less computational cost.
EEA demonstrates stable and precise evaluation through adaptive reward combination.
Abstract
Long-form video understanding requires efficient navigation of extensive visual data to pinpoint sparse yet critical information. Current approaches to longform video understanding either suffer from severe computational overhead due to dense preprocessing, or fail to effectively balance exploration and exploitation, resulting in incomplete information coverage and inefficiency. In this work, we introduce EEA, a novel video agent framework that archives exploration-exploitation balance through semantic guidance with hierarchical tree search process. EEA autonomously discovers and dynamically updates task-relevant semantic queries, and collects video frames closely matched to these queries as semantic anchors. During the tree search process, instead of uniform expansion, EEA preferentially explores semantically relevant frames while ensuring sufficient coverage within unknown segments.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
