Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

Shihao Cheng; Jiaxu Zhang; Quanyue Song; Shansong Liu; Zhizhi Guo; Xiaolei Zhang; Chi Zhang; Xuelong Li; Zhigang Tu

arXiv:2605.08729·cs.CV·May 12, 2026

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

Shihao Cheng, Jiaxu Zhang, Quanyue Song, Shansong Liu, Zhizhi Guo, Xiaolei Zhang, Chi Zhang, Xuelong Li, Zhigang Tu

PDF

TL;DR

Unison is a unified framework that improves the coherence and synchronization of motion, speech, and sound in human-centric videos by explicitly modeling their interactions.

Contribution

It introduces novel semantic-guided harmonization and bidirectional cross-modal forcing strategies for enhanced multimodal alignment.

Findings

01

Achieves state-of-the-art audio perceptual quality.

02

Improves cross-modal synchronization accuracy.

03

Effectively mitigates speech dominance and environmental noise.

Abstract

Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to maintain consistent alignment across these modalities, leading to noticeable mismatches between motion, speech, and environmental sounds. We present Unison, a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, Unison employs a semantic-guided harmonization strategy that decouples the generation of speech and sound-effect components. Leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, this approach effectively mitigates speech dominance and enhances acoustic clarity. For audio-motion synchronization, we propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.