Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark
Rajmund Nagy (1), Hendric Voss (2), Thanh Hoang-Minh (3), Mihail Tsakov (4), Teodor Nikolov (5), Zeyi Zhang (6), Tenglong Ao (6), Sicheng Yang (7), Shaoli Huang (8), Yongkang Cheng (8), M. Hamza Mughal (9), Rishabh Dabral (9), Kiran Chhatre (1), Christian Theobalt (9)

TL;DR
This paper highlights the lack of standardised human evaluation practices in speech-driven gesture generation, introduces a detailed protocol for the BEAT2 dataset, and provides comprehensive benchmark results and resources to improve future research.
Contribution
It proposes a standardized human evaluation protocol for gesture generation, conducts a large-scale benchmark across models, and releases datasets and tools to facilitate consistent future assessments.
Findings
Motion realism is saturated; older models perform as well as newer ones.
High speech-gesture alignment claims do not hold under rigorous evaluation.
Disentangled assessments of motion quality and alignment are necessary for accurate benchmarking.
Abstract
We review human evaluation practices in automatic, speech-driven 3D gesture generation and find a lack of standardisation and frequent use of flawed experimental setups. This leads to a situation where it is impossible to know how different methods compare, or what the state of the art is. In order to address common shortcomings of evaluation design, and to standardise future user studies in gesture-generation works, we introduce a detailed human evaluation protocol for the widely-used BEAT2 motion-capture dataset. Using this protocol, we conduct large-scale crowdsourced evaluation to rank six recent gesture-generation models -- each trained by its original authors -- across two key evaluation dimensions: motion realism and speech-gesture alignment. Our results show that 1) motion realism has become a saturated evaluation measure on the BEAT2 dataset, with older models performing on par…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
