VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang,, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu

TL;DR
VideoEspresso introduces a large-scale, fine-grained video reasoning dataset with multimodal annotations and a novel framework that enhances reasoning by selecting core frames and leveraging chain-of-thought methods, improving performance on complex VideoQA tasks.
Contribution
The paper presents a new dataset, VideoEspresso, with detailed annotations and reasoning steps, along with a hybrid LVLM framework that improves video reasoning by core frame selection and multimodal CoT reasoning.
Findings
Outperforms existing methods on most tasks in the benchmark
Demonstrates the effectiveness of core frame selection for reasoning
Enables more accurate and detailed video understanding
Abstract
The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Explainable Artificial Intelligence (XAI)
