AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
Madeleine Grunde-McLaughlin, Ranjay Krishna, Maneesh Agrawala

TL;DR
AGQA is a large, balanced benchmark for evaluating models' ability to reason about complex spatio-temporal visual events, revealing significant gaps in current systems' generalization and reasoning capabilities.
Contribution
The paper introduces AGQA, a comprehensive and balanced benchmark with 192 million question-answer pairs for assessing compositional spatio-temporal reasoning in videos, and demonstrates current models' limitations.
Findings
Models perform poorly on AGQA, barely surpassing non-visual baselines.
Existing models do not generalize well to novel compositions.
AGQA reveals significant gaps in current visual reasoning systems.
Abstract
Visual events are a composition of temporal actions involving actors spatially interacting with objects. When developing computer vision models that can reason about compositional spatio-temporal events, we need benchmarks that can analyze progress and uncover shortcomings. Existing video question answering benchmarks are useful, but they often conflate multiple sources of error into one accuracy metric and have strong biases that models can exploit, making it difficult to pinpoint model weaknesses. We present Action Genome Question Answering (AGQA), a new benchmark for compositional spatio-temporal reasoning. AGQA contains unbalanced question answer pairs for videos. We also provide a balanced subset of question answer pairs, orders of magnitude larger than existing benchmarks, that minimizes bias by balancing the answer distributions and types of question…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
