Semantics-Aware Human Motion Generation from Audio Instructions

Zi-An Wang; Shihao Zou; Shiyao Yu; Mingyuan Zhang; Chao Dong

arXiv:2505.23465·cs.SD·May 30, 2025

Semantics-Aware Human Motion Generation from Audio Instructions

Zi-An Wang, Shihao Zou, Shiyao Yu, Mingyuan Zhang, Chao Dong

PDF

TL;DR

This paper introduces a novel framework that uses audio signals as semantic inputs to generate human motions, enabling more natural interactions compared to traditional text or rhythm-based methods.

Contribution

It presents an end-to-end masked generative transformer with memory-retrieval attention for audio-conditioned motion generation, and enriches datasets with conversational descriptions and varied speaker audio.

Findings

01

Effective semantic alignment between audio instructions and generated motions

02

Enhanced dataset with conversational style descriptions and diverse speaker audio

03

Demonstrated efficiency and practicality of audio-based motion generation

Abstract

Recent advances in interactive technologies have highlighted the prominence of audio signals for semantic encoding. This paper explores a new task, where audio signals are used as conditioning inputs to generate motions that align with the semantics of the audio. Unlike text-based interactions, audio provides a more natural and intuitive communication method. However, existing methods typically focus on matching motions with music or speech rhythms, which often results in a weak connection between the semantics of the audio and generated motions. We propose an end-to-end framework using a masked generative transformer, enhanced by a memory-retrieval attention module to handle sparse and lengthy audio inputs. Additionally, we enrich existing datasets by converting descriptions into conversational style and generating corresponding audio with varied speaker identities. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Focus · ALIGN