VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng, Cheng, Gedas Bertasius, Mohit Bansal

TL;DR
VideoTree introduces a query-adaptive, hierarchical video representation framework that enhances large language model reasoning on long videos by efficiently extracting relevant information without additional training.
Contribution
It proposes a training-free, hierarchical, query-adaptive video representation method that improves reasoning accuracy and efficiency on long videos for LLMs.
Findings
Outperforms existing training-free methods on EgoSchema and NExT-QA datasets.
Achieves higher accuracy than GPT-4V on long video data.
Reduces inference time while maintaining or improving performance.
Abstract
Long-form video understanding is complicated by the high redundancy of video data and the abundance of query-irrelevant information. To tackle these challenges, we propose VideoTree, a training-free framework which builds a query-adaptive and hierarchical video representation for LLM reasoning over long-form videos. First, VideoTree extracts query-relevant information from the input video through an iterative process, progressively refining the selection of keyframes based on their relevance to the query. Furthermore, VideoTree leverages the inherent hierarchical structure of long video data, which is often overlooked by existing LLM-based methods. Specifically, we incorporate multi-granularity information into a tree-based representation, allowing VideoTree to extract query-relevant details from long videos in a coarse-to-fine manner. This enables the model to effectively handle a wide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Natural Language Processing Techniques · Multimedia Communication and Technology
MethodsSparse Evolutionary Training
