TL;DR
Video-ToC introduces a tree-of-cue reasoning framework for video understanding, combining structured visual cues, adaptive reward mechanisms, and new datasets to improve reasoning capabilities in Video LLMs.
Contribution
The paper presents a novel tree-of-cue reasoning approach with structured visual cues, adaptive rewards, and new datasets for enhanced video understanding in LLMs.
Findings
Outperforms baselines on six video understanding benchmarks.
Demonstrates improved reasoning and perception in video analysis.
Achieves superior results on a video hallucination benchmark.
Abstract
Existing Video Large Language Models (Video LLMs) struggle with complex video understanding, exhibiting limited reasoning capabilities and potential hallucinations. In particular, these methods tend to perform reasoning solely relying on the pretrained inherent reasoning rationales whilst lacking perception-aware adaptation to the input video content. To address this, we propose \textbf{Video-ToC}, a novel video reasoning framework that enhances video understanding through tree-of-cue reasoning. Specifically, our approach introduces three key innovations: (1) A tree-guided visual cue localization mechanism, which endows the model with enhanced fine-grained perceptual capabilities through structured reasoning patterns; (2) A reasoning-demand reward mechanism, which dynamically adjusts the reward value for reinforcement learning (RL) based on the estimation of reasoning demands, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
