EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation

Fathinah Izzati; Xinyue Li; Gus Xia

arXiv:2507.04955·cs.SD·July 8, 2025

EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation

Fathinah Izzati, Xinyue Li, Gus Xia

PDF

TL;DR

Expotion is a multimodal music generation model that uses facial expressions, upper-body motion, and text prompts, employing parameter-efficient fine-tuning and temporal smoothing to produce synchronized, expressive music.

Contribution

The paper introduces a novel multimodal control framework for music generation, combining visual gestures and text prompts with a new dataset and a temporal alignment strategy.

Findings

01

Enhanced musicality and creativity in generated music

02

Improved temporal synchronization with video

03

Outperforms existing state-of-the-art models

Abstract

We propose Expotion (Facial Expression and Motion Control for Multimodal Music Generation), a generative model leveraging multimodal visual controls - specifically, human facial expressions and upper-body motion - as well as text prompts to produce expressive and temporally accurate music. We adopt parameter-efficient fine-tuning (PEFT) on the pretrained text-to-music generation model, enabling fine-grained adaptation to the multimodal controls using a small dataset. To ensure precise synchronization between video and music, we introduce a temporal smoothing strategy to align multiple modalities. Experiments demonstrate that integrating visual features alongside textual descriptions enhances the overall quality of generated music in terms of musicality, creativity, beat-tempo consistency, temporal alignment with the video, and text adherence, surpassing both proposed baselines and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.