ActionAtlas: A VideoQA Benchmark for Domain-specialized Action   Recognition

Mohammadreza Salehi; Jae Sung Park; Tanush Yadav; Aditya Kusupati,; Ranjay Krishna; Yejin Choi; Hannaneh Hajishirzi; Ali Farhadi

arXiv:2410.05774·cs.CV·November 12, 2024

ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition

Mohammadreza Salehi, Jae Sung Park, Tanush Yadav, Aditya Kusupati,, Ranjay Krishna, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi

PDF

Open Access

TL;DR

ActionAtlas v1.0 is a comprehensive video question answering benchmark designed to evaluate models' ability to recognize subtle, domain-specific actions in sports videos, highlighting current model limitations and the importance of high frame sampling.

Contribution

The paper introduces ActionAtlas v1.0, a new challenging benchmark for fine-grained action recognition in videos, emphasizing domain-specific subtle movements and evaluating foundation models' performance.

Findings

01

GPT-4o achieves 45.52% accuracy on ActionAtlas.

02

Non-expert crowd workers achieve 61.64% accuracy.

03

High frame sampling rate improves model performance.

Abstract

Our world is full of varied actions and moves across specialized domains that we, as humans, strive to identify and understand. Within any single domain, actions can often appear quite similar, making it challenging for deep models to distinguish them accurately. To evaluate the effectiveness of multimodal foundation models in helping us recognize such actions, we present ActionAtlas v1.0, a multiple-choice video question answering benchmark featuring short videos across various sports. Each video in the dataset is paired with a question and four or five choices. The question pinpoints specific individuals, asking which choice "best" describes their action within a certain temporal context. Overall, the dataset includes 934 videos showcasing 580 unique actions across 56 sports, with a total of 1896 actions within choices. Unlike most existing video question answering benchmarks that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications