Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation
Yudi Shi, Shangzhe Di, Qirui Chen, Weidi Xie

TL;DR
This paper introduces Agent-of-Thoughts Distillation (AoTD), a novel method that improves VideoQA models by integrating reasoning chains and verification mechanisms, leading to better performance and explainability.
Contribution
The paper proposes AoTD, a new approach that incorporates automatically generated reasoning chains and verification to enhance VideoQA models' reasoning and explainability.
Findings
AoTD improves performance on multiple VideoQA benchmarks.
The method enhances model explainability through reasoning chains.
Verification mechanism increases the reliability of generated reasoning.
Abstract
This paper tackles the problem of video question answering (VideoQA), a task that often requires multi-step reasoning and a profound understanding of spatial-temporal dynamics. While large video-language models perform well on benchmarks, they often lack explainability and spatial-temporal grounding. In this paper, we propose Agent-of-Thoughts Distillation (AoTD), a method that enhances models by incorporating automatically generated Chain-of-Thoughts (CoTs) into the instruction-tuning process. Specifically, we leverage an agent-based system to decompose complex questions into sub-tasks, and address them with specialized vision models, the intermediate results are then treated as reasoning chains. We also introduce a verification mechanism using a large language model (LLM) to ensure the reliability of generated CoTs. Extensive experiments demonstrate that AoTD improves the performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Data Visualization and Analytics
