MOSPA: Human Motion Generation Driven by Spatial Audio

Shuyang Xu; Zhiyang Dou; Mingyi Shi; Liang Pan; Leo Ho; Jingbo Wang; Yuan Liu; Cheng Lin; Yuexin Ma; Wenping Wang; Taku Komura

arXiv:2507.11949·cs.GR·November 4, 2025

MOSPA: Human Motion Generation Driven by Spatial Audio

Shuyang Xu, Zhiyang Dou, Mingyi Shi, Liang Pan, Leo Ho, Jingbo Wang, Yuan Liu, Cheng Lin, Yuexin Ma, Wenping Wang, Taku Komura

PDF

Open Access 1 Models

TL;DR

This paper introduces MOSPA, a diffusion-based framework that generates realistic human motions driven by spatial audio, supported by a new comprehensive dataset, advancing the modeling of human responses to auditory stimuli.

Contribution

The paper presents the first spatial audio-driven human motion dataset and a diffusion-based model for realistic motion generation conditioned on spatial audio inputs.

Findings

01

MOSPA achieves state-of-the-art performance on spatial audio-driven motion generation.

02

The dataset enables diverse and high-quality motion-audio research.

03

The model effectively captures the relationship between spatial audio cues and human motion.

Abstract

Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
JimSYXu/MOSPA
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Human Motion and Animation · Music and Audio Processing