Neptune: The Long Orbit to Benchmarking Long Video Understanding
Arsha Nagrani, Mingda Zhang, Ramin Mehran, Rachel Hornung, Nitesh, Bharadwaj Gundavarapu, Nilpa Jha, Austin Myers, Xingyi Zhou, Boqing Gong,, Cordelia Schmid, Mikhail Sirotenko, Yukun Zhu, Tobias Weyand

TL;DR
Neptune is a new benchmark dataset designed to evaluate long video understanding, emphasizing reasoning over extended time periods and multiple modalities, to advance the development of more capable models.
Contribution
We introduce Neptune, a scalable dataset created with large models for dense captions and questions, and a new metric GEM for evaluating open-ended responses in long video understanding.
Findings
Current models perform poorly on Neptune, especially on temporal reasoning tasks.
Neptune covers diverse long video reasoning abilities, including multimodal reasoning.
The dataset and metric facilitate the development of more advanced long video understanding models.
Abstract
We introduce Neptune, a benchmark for long video understanding that requires reasoning over long time horizons and across different modalities. Many existing video datasets and models are focused on short clips (10s-30s). While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost. In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length). Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning. Since existing metrics for open-ended question…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The semi-automated annotation method proposed in this paper overcomes some of thechallenges associated with manual annotation, such as the creation of complex temporal reasoning questions, which can be laborious for humans. The utilization of GPT to generate these questions reduces this burden. 2. The introduction of the GEM metric addresses the need for a static, open-source evaluation metric for open-ended VideoQA, which has been a limitation in prior work. 3. The paper clearly articulate
1. The key contributions that distinguish it from previous benchmarks are not clearly highlighted. Although the paper has made a simple comparison with EgoSchema on the impact of the number of input frames on performance, I feel that it is too simplistic. It would be constructive to see more comparisons with other existing datasets (e.g.,VideoMME) to understand how NEPTUNE's complexity and diversity align or differ from them, which could highlight the unique challenges it presents. 2. In additi
1. Neptune addresses several essential question types related to temporal-aware video understanding, including the challenging temporal ordering and counting. 2. Two subsets are introduced to comprehensively assess current multi-modal large language models.
1. In Figure 4, the authors show that EgoSchema reaches saturation at approximately 16 frames, while performance continues to increase with Neptune. This conclusion is drawn based on the powerful Gemini model; it would be beneficial to additionally include results from some open-source models (e.g., short or long context MLLMs) to better promote the development of open-source MLLMs. 2. Model names should be consistently formatted (e.g., VideoLLaMA2 vs. VideoLlaMA2).
1. The paper introduces an innovative semi-automatic pipeline designed to generate question-answer-decoy (QAD) sets. This method effectively reduces the annotation costs associated with manual video captioning and question generation. By leveraging large models such as VLMs and LLMs, the pipeline automates the creation of dense, time-aligned video captions, which are then used to derive challenging QAD sets for segments of video content. This approach not only scales well but also maintains a hi
1. The paper does not include an evaluation and comparison with the latest open source models such as InternVL, LLaVA-OneVision, and MiniCPM. These models are part of the current research landscape and offer a different perspective on video understanding capabilities 2. The paper primarily focuses on the analysis of benchmarks like NextQA and EgoSchema but does not provide a thorough comparison with more recent benchmarks such as MLVU, Video-MME, and LongVideoBench, which are designed to evalu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · CCD and CMOS Imaging Sensors · Parallel Computing and Optimization Techniques
