EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation
Fathinah Izzati, Xinyue Li, Gus Xia

TL;DR
Expotion is a multimodal music generation model that uses facial expressions, upper-body motion, and text prompts, employing parameter-efficient fine-tuning and temporal smoothing to produce synchronized, expressive music.
Contribution
The paper introduces a novel multimodal control framework for music generation, combining visual gestures and text prompts with a new dataset and a temporal alignment strategy.
Findings
Enhanced musicality and creativity in generated music
Improved temporal synchronization with video
Outperforms existing state-of-the-art models
Abstract
We propose Expotion (Facial Expression and Motion Control for Multimodal Music Generation), a generative model leveraging multimodal visual controls - specifically, human facial expressions and upper-body motion - as well as text prompts to produce expressive and temporally accurate music. We adopt parameter-efficient fine-tuning (PEFT) on the pretrained text-to-music generation model, enabling fine-grained adaptation to the multimodal controls using a small dataset. To ensure precise synchronization between video and music, we introduce a temporal smoothing strategy to align multiple modalities. Experiments demonstrate that integrating visual features alongside textual descriptions enhances the overall quality of generated music in terms of musicality, creativity, beat-tempo consistency, temporal alignment with the video, and text adherence, surpassing both proposed baselines and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
