ExAct: A Video-Language Benchmark for Expert Action Analysis

Han Yi; Yulu Pan; Feihong He; Xinyu Liu; Benjamin Zhang; Oluwatumininu Oguntola; Gedas Bertasius

arXiv:2506.06277·cs.CV·December 12, 2025

ExAct: A Video-Language Benchmark for Expert Action Analysis

Han Yi, Yulu Pan, Feihong He, Xinyu Liu, Benjamin Zhang, Oluwatumininu Oguntola, Gedas Bertasius

PDF

Open Access 1 Datasets 1 Video

TL;DR

ExAct is a comprehensive video-language benchmark designed to evaluate expert-level understanding of physical human activities across multiple domains, highlighting significant gaps in current model performance.

Contribution

The paper introduces ExAct, a new benchmark with expert-curated question-answer pairs for fine-grained understanding of physical skills, and evaluates current models showing substantial performance gaps.

Findings

01

GPT-4o achieves only 44.70% accuracy on ExAct.

02

Human experts attain 82.02% accuracy, indicating room for improvement.

03

ExAct reveals challenges in current VLMs for expert-level understanding.

Abstract

We present ExAct, a new video-language benchmark for expert-level understanding of skilled physical human activities. Our new benchmark contains 3521 expert-curated video question-answer pairs spanning 11 physical activities in 6 domains: Sports, Bike Repair, Cooking, Health, Music, and Dance. ExAct requires the correct answer to be selected from five carefully designed candidate options, thus necessitating a nuanced, fine-grained, expert-level understanding of physical human skills. Evaluating the recent state-of-the-art VLMs on ExAct reveals a substantial performance gap relative to human expert performance. Specifically, the best-performing GPT-4o model achieves only 44.70% accuracy, well below the 82.02% attained by trained human specialists/experts. We believe that ExAct will be beneficial for developing and evaluating VLMs capable of precise understanding of human skills in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Alexhimself/ExAct
dataset· 358 dl
358 dl

Videos

ExAct: A Video-Language Benchmark for Expert Action Analysis· slideslive

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)