Ges-QA: A Multidimensional Quality Assessment Dataset for Audio-to-3D Gesture Generation

Zhilin Gao; Yunhao Li; Sijing Wu; Yuqin Cao; Huiyu Duan; Guangtao Zhai

arXiv:2508.12020·cs.MM·March 27, 2026

Ges-QA: A Multidimensional Quality Assessment Dataset for Audio-to-3D Gesture Generation

Zhilin Gao, Yunhao Li, Sijing Wu, Yuqin Cao, Huiyu Duan, Guangtao Zhai

PDF

Open Access

TL;DR

This paper introduces Ges-QA, a comprehensive dataset and a multi-modal transformer model for evaluating the quality of AI-generated 3D gestures from audio, addressing limitations of existing metrics by aligning with human preferences.

Contribution

The paper presents the first multidimensional quality assessment dataset for audio-to-3D gesture generation and a novel transformer-based model for multi-dimensional evaluation.

Findings

01

Ges-QA dataset contains 1,400 samples with multidimensional scores.

02

The proposed Ges-QAer model achieves state-of-the-art performance.

03

Multi-modal approach effectively assesses gesture quality and emotion matching.

Abstract

The Audio-to-3D-Gesture (A2G) task has enormous potential for various applications in virtual reality and computer graphics, etc. However, current evaluation metrics, such as Fr\'echet Gesture Distance or Beat Constancy, fail at reflecting the human preference of the generated 3D gestures. To cope with this problem, exploring human preference and an objective quality assessment metric for AI-generated 3D human gestures is becoming increasingly significant. In this paper, we introduce the Ges-QA dataset, which includes 1,400 samples with multidimensional scores for gesture quality and audio-gesture consistency. Moreover, we collect binary classification labels to determine whether the generated gestures match the emotions of the audio. Equipped with our Ges-QA dataset, we propose a multi-modal transformer-based neural network with 3 branches for video, audio and 3D skeleton modalities,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Music and Audio Processing · Hand Gesture Recognition Systems