Can Vision Language Models Judge Action Quality? An Empirical Evaluation
Miguel Monte e Freitas, Rui Henriques, Ricardo Rei, Pedro Henrique Martins

TL;DR
This paper evaluates the performance of state-of-the-art Vision Language Models in Action Quality Assessment across various domains, revealing significant limitations and biases that hinder reliable fine-grained movement evaluation.
Contribution
The study provides a comprehensive empirical evaluation of VLMs in AQA, highlighting their marginal performance and systematic biases, and establishes a baseline for future research.
Findings
VLMs perform only marginally above chance in AQA tasks.
Incorporating additional information yields isolated, inconsistent gains.
Models exhibit biases towards predicting correct execution regardless of visual evidence.
Abstract
Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
