ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos
Zhou Yu, Lixiang Zheng, Zhou Zhao, Fei Wu, Jianping Fan, Kui Ren, Jun, Yu

TL;DR
ANetQA is a large-scale benchmark for fine-grained compositional reasoning in video question answering, using automatically generated QA pairs from detailed scene graphs in untrimmed videos, enabling better diagnosis of model capabilities.
Contribution
It introduces ANetQA, a novel benchmark with fine-grained semantics and diverse questions, surpassing previous datasets in scale and detail for VideoQA evaluation.
Findings
Achieves 44.5% accuracy with current models.
Contains 1.4 billion QA pairs, significantly larger than prior benchmarks.
Highlights substantial gap between model performance and human accuracy.
Abstract
Building benchmarks to systemically analyze different capabilities of video question answering (VideoQA) models is challenging yet crucial. Existing benchmarks often use non-compositional simple questions and suffer from language biases, making it difficult to diagnose model weaknesses incisively. A recent benchmark AGQA poses a promising paradigm to generate QA pairs automatically from pre-annotated scene graphs, enabling it to measure diverse reasoning abilities with granular control. However, its questions have limitations in reasoning about the fine-grained semantics in videos as such information is absent in its scene graphs. To this end, we present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over the challenging untrimmed videos from ActivityNet. Similar to AGQA, the QA pairs in ANetQA are automatically generated from annotated video scene…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
