TL;DR
This paper introduces Deep Video Discovery, an agentic search framework utilizing LLMs and adaptive tool use to improve understanding of long-form videos, achieving state-of-the-art results on benchmark datasets.
Contribution
The paper presents a novel agentic search approach that adaptively orchestrates tools for long video understanding, surpassing previous methods in accuracy.
Findings
Achieves 74.2% accuracy on LVBench dataset.
Improves to 76.0% accuracy with transcripts.
Demonstrates superior performance over prior methods.
Abstract
Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery (DVD) agent to leverage an agentic search strategy over segmented video clips. Unlike previous video agents that rely on predefined workflows applied uniformly across different queries, our approach emphasizes the autonomous and adaptive nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsSparse Evolutionary Training
