CinePile: A Long Video Question Answering Dataset and Benchmark
Ruchit Rawal, Khalid Saifullah, Miquel Farr\'e, Ronen Basri, David, Jacobs, Gowthami Somepalli, Tom Goldstein

TL;DR
CinePile introduces a large-scale, challenging dataset for long-form video question answering, emphasizing genuine comprehension over superficial analysis, and evaluates current models' performance on this benchmark.
Contribution
The paper presents CinePile, a novel dataset with 305,000 questions for authentic long-video understanding, and benchmarks models' capabilities in this domain.
Findings
Models underperform humans on CinePile tasks.
Fine-tuning improves model performance significantly.
The dataset covers diverse multimodal and temporal reasoning aspects.
Abstract
Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we fine-tuned open-source Video-LLMs on the training split…
Peer Reviews
Decision·Submitted to ICLR 2025
- Scale and Efficiency: The paper presents an innovative automated pipeline that created 305,000 high-quality QA pairs - which is 70-75x larger than comparable datasets like MoVQA (21,953) or Video-MME (2,700). - Clear Benchmark Value: The significant performance gap between humans (73.21%) and best models (60.12% for Gemini 1.5 Pro) demonstrates that this dataset effectively challenges current state-of-the-art systems and provides meaningful room for improvement.
- The paper lacks rigorous statistical analysis of the dataset's properties - for example, there are no inter-annotator agreement scores for human evaluation and no analysis of the language content of the answers or questions. For instance, with llm-generated answers, the correct answer often is longer than the wrong answer which becomes an issue - The "Setting and Technical Analysis" (STA) questions may not truly test long-form understanding since they could potentially be answered by analyzing
1. The paper introduces a detailed and effective pipeline for generating a high-quality question-answering dataset, CinePile, which is specifically tailored for long-form video understanding. 2. To ensure the quality of the automatically labeled data, the authors have developed an adversarial refinement process that rigorously evaluates and improves the dataset's accuracy. 3. Additionally, the paper includes an extensive testing suite that assesses the performance of the latest open-source m
1. The authors' dataset statistics table compares CinePile with a variety of VideoQA datasets. However, there are two minor issues. Firstly, CinePile predominantly consists of movie clips, and a separate analysis comparing it with other movie QA datasets would be beneficial for a more targeted evaluation. Secondly, the videos in CinePile are only 160 seconds long, which doesn't offer a significant advantage in terms of length over other datasets. 2. While the authors' use of templates to constr
1. This paper gives a large-scale and challenging long-form video VQA dataset. The automated pipeline for question generation is a major strength. The use of readily available ADs and LLMs offers a cost-effective and scalable approach to annotate long videos. Also, the paper includes detailed analyses of various aspects of the dataset, including question type distribution, vision reliance, and question difficulty. The authors demonstrate a deep understanding of the challenges involved in crea
1. The description of the adversarial refinement process could be more detailed. Specific examples of how questions were modified and the criteria for determining success would be beneficial. 2. While the authors acknowledge potential biases in the LLMs used for question generation and the geographical limitations of the movie clips, a more in-depth discussion of potential biases (e.g. countries, languages, etc.) in the dataset itself and mitigation strategies would strengthen the paper. How we
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
