VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained   Video Reasoning via Core Frame Selection

Songhao Han; Wei Huang; Hairong Shi; Le Zhuo; Xiu Su; Shifeng Zhang,; Xu Zhou; Xiaojuan Qi; Yue Liao; Si Liu

arXiv:2411.14794·cs.CV·November 25, 2024

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang,, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu

PDF

Open Access 1 Repo 2 Datasets

TL;DR

VideoEspresso introduces a large-scale, fine-grained video reasoning dataset with multimodal annotations and a novel framework that enhances reasoning by selecting core frames and leveraging chain-of-thought methods, improving performance on complex VideoQA tasks.

Contribution

The paper presents a new dataset, VideoEspresso, with detailed annotations and reasoning steps, along with a hybrid LVLM framework that improves video reasoning by core frame selection and multimodal CoT reasoning.

Findings

01

Outperforms existing methods on most tasks in the benchmark

02

Demonstrates the effectiveness of core frame selection for reasoning

03

Enables more accurate and detailed video understanding

Abstract

The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hshjerry/videoespresso
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Explainable Artificial Intelligence (XAI)