Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark

Rajmund Nagy (1); Hendric Voss (2); Thanh Hoang-Minh (3); Mihail Tsakov (4); Teodor Nikolov (5); Zeyi Zhang (6); Tenglong Ao (6); Sicheng Yang (7); Shaoli Huang (8); Yongkang Cheng (8); M. Hamza Mughal (9); Rishabh Dabral (9); Kiran Chhatre (1); Christian Theobalt (9); Libin Liu (6); Stefan Kopp (2); Rachel McDonnell (10); Michael Neff (11); Taras Kucherenko (12); Youngwoo Yoon (13); Gustav Eje Henter (1; 5) ((1) KTH Royal Institute of Technology; (2) Bielefeld University; (3) University of Science -- VNUHCM; (4) Independent Researcher; (5) Motorica AB; (6) Peking University; (7) Huawei Technologies Ltd.; (8) Astribot; (9) Max-Planck Institute for Informatics; SIC; (10) Trinity College Dublin; (11) University of California; Davis; (12) SEED -- Electronic Arts; (13) Electronics; Telecommunications Research Institute (ETRI))

arXiv:2511.01233·cs.CV·April 23, 2026

Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark

Rajmund Nagy (1), Hendric Voss (2), Thanh Hoang-Minh (3), Mihail Tsakov (4), Teodor Nikolov (5), Zeyi Zhang (6), Tenglong Ao (6), Sicheng Yang (7), Shaoli Huang (8), Yongkang Cheng (8), M. Hamza Mughal (9), Rishabh Dabral (9), Kiran Chhatre (1), Christian Theobalt (9)

PDF

TL;DR

This paper highlights the lack of standardised human evaluation practices in speech-driven gesture generation, introduces a detailed protocol for the BEAT2 dataset, and provides comprehensive benchmark results and resources to improve future research.

Contribution

It proposes a standardized human evaluation protocol for gesture generation, conducts a large-scale benchmark across models, and releases datasets and tools to facilitate consistent future assessments.

Findings

01

Motion realism is saturated; older models perform as well as newer ones.

02

High speech-gesture alignment claims do not hold under rigorous evaluation.

03

Disentangled assessments of motion quality and alignment are necessary for accurate benchmarking.

Abstract

We review human evaluation practices in automatic, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results show that 1) motion realism has become a saturated evaluation measure on the BEAT2 dataset, with older models performing on par…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.