ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning   over Untrimmed Videos

Zhou Yu; Lixiang Zheng; Zhou Zhao; Fei Wu; Jianping Fan; Kui Ren; Jun; Yu

arXiv:2305.02519·cs.CV·May 5, 2023·1 cites

ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos

Zhou Yu, Lixiang Zheng, Zhou Zhao, Fei Wu, Jianping Fan, Kui Ren, Jun, Yu

PDF

Open Access 1 Repo

TL;DR

ANetQA is a large-scale benchmark for fine-grained compositional reasoning in video question answering, using automatically generated QA pairs from detailed scene graphs in untrimmed videos, enabling better diagnosis of model capabilities.

Contribution

It introduces ANetQA, a novel benchmark with fine-grained semantics and diverse questions, surpassing previous datasets in scale and detail for VideoQA evaluation.

Findings

01

Achieves 44.5% accuracy with current models.

02

Contains 1.4 billion QA pairs, significantly larger than prior benchmarks.

03

Highlights substantial gap between model performance and human accuracy.

Abstract

Building benchmarks to systemically analyze different capabilities of video question answering (VideoQA) models is challenging yet crucial. Existing benchmarks often use non-compositional simple questions and suffer from language biases, making it difficult to diagnose model weaknesses incisively. A recent benchmark AGQA poses a promising paradigm to generate QA pairs automatically from pre-annotated scene graphs, enabling it to measure diverse reasoning abilities with granular control. However, its questions have limitations in reasoning about the fine-grained semantics in videos as such information is absent in its scene graphs. To this end, we present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over the challenging untrimmed videos from ActivityNet. Similar to AGQA, the QA pairs in ANetQA are automatically generated from annotated video scene…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MILVLG/anetqa-code
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning