LongViTU: Instruction Tuning for Long-Form Video Understanding
Rujie Wu, Xiaojian Ma, Hai Ci, Yue Fan, Yuxuan Wang, Haozhe Zhao, Qing, Li, Yizhou Wang

TL;DR
LongViTU is a large-scale, high-quality dataset for long-form video understanding that emphasizes long-term context and reasoning, enabling improved evaluation and fine-tuning of video understanding models.
Contribution
The paper introduces LongViTU, a novel dataset with hierarchical QA generation and timestamp annotations, and establishes a benchmark for long-term video comprehension.
Findings
GPT-4 scores of 49.9 and 52.3 on LongViTU benchmark for models
Supervised fine-tuning improves model performance by 2.5-3.7%
Human annotators achieve 81.0 GPT-4 score, indicating dataset difficulty
Abstract
This paper introduces LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. We propose a systematic approach that organizes videos into a hierarchical tree structure for QA generation and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.)). We also offer explicit timestamp annotations of relevant events for each QA pair. We have conducted extensive human studies on LongViTU, and the results prove the quality of our dataset. To better evaluate the challenges posed by LongViTU's emphasis on long-term context and condensed reasoning, we manually curate a subset of LongViTU into a benchmark. Evaluations using a…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The paper is easy to follow, and the experiments are clearly described. - The dataset is of high quality, featuring a large number of QA pairs and encompassing a variety of diverse scenarios.
- Figure 1: The icons, while visually appealing, come across as unprofessional and occupy space that could be better utilized to present more information. - Ablation Studies: The paper lacks ablation studies for different-level captions. For instance, it would be beneficial to know if event-level captions can be skipped without significant detriment. - Results: Additional results are necessary to clarify the performance of different Multi-modal Large Language Models (MLLMs) on LongViTU videos wi
1. LongViTU explicitly addresses the limitations of temporal context, length, and fine-grained question types from the perspective of sft. The hierarchical pipeline for automatic dataset generation is a sound procedure to create long-form annotations from bottom to top. Its sheer scale of the dataset (~900 hours of video) and its diversity in terms of scenarios and question types are decent. The use of Ego4D ensures real-world relevance. 2. The paper includes a thorough quantitative evaluation
1. The reliance on LLMs (GPT-4) throughout the pipeline raises concerns about potential biases inherited from the pre-training data of these models. Moreover, a hierarchical pipeline may cause error cumulation, making the bias even worse. A thorough analysis of potential biases in the generated QA pairs is missing. 2. While self-revision is employed, a more robust human evaluation of the dataset quality would strengthen the paper's claims. The current human evaluation seems limited to Appendi
1. The approach of organizing video content into a hierarchical tree structure is innovative. This method allows for the generation of question-answer pairs that capture both spatial and temporal details, which is a creative extension of existing video understanding frameworks. 2. The dataset provides fine-grained categorization of questions, which is crucial for advancing the understanding of complex video content and adds depth to the quality of the dataset.
1. In Table 2, it can be observed that there is a lack of differentiation in the benchmark. The performance gap between the best-performing Gemini-1.5-Pro and the other models is not evident. According to the reviewer, in most existing benchmarks, Gemini-1.5-Pro demonstrates a significant performance advantage over Video-LLaVA. 2. The proposed benchmark employs GPT-4 for assessment, which may introduce additional bias. 3. The validation method employed was released some time ago, and its base
1 Topic is good. Long-form video understanding is a challenging but important problem. Developping a benchmark for instruct tuning and evaluation is critical in this problem. 2 Experiments are sufficient. The experimental studies are interesting to show the challenges and potentials of this benchmark.
1 This benchmark is based on EGO4D. Hence, the annotation would be similar to EgoTaskQA. As shown in Table 1, the difference is the increasing scale of data set and the newly-added timestep annotations. Is such timestep annotation important or not? Are there any expermental results to show its impact on your benchmark ? 2 The hierarchical video tree style design is similar to [MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding, arXiv:2312.04817]. 3 The paper
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Human Pose and Action Recognition · Advanced Vision and Imaging
MethodsAttention Is All You Need · Absolute Position Encodings · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
