EMOVA: Empowering Language Models to See, Hear and Speak with Vivid   Emotions

Kai Chen; Yunhao Gou; Runhui Huang; Zhili Liu; Daxin Tan; Jing Xu,; Chunwei Wang; Yi Zhu; Yihan Zeng; Kuo Yang; Dingdong Wang; Kun Xiang; Haoyuan; Li; Haoli Bai; Jianhua Han; Xiaohui Li; Weike Jin; Nian Xie; Yu Zhang; James; T. Kwok; Hengshuang Zhao; Xiaodan Liang; Dit-Yan Yeung; Xiao Chen; Zhenguo; Li; Wei Zhang; Qun Liu; Jun Yao; Lanqing Hong; Lu Hou; Hang Xu

arXiv:2409.18042·cs.CV·March 21, 2025

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu,, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan, Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James, T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung

PDF

Open Access 10 Models 5 Datasets

TL;DR

EMOVA is a novel omni-modal model that enables large language models to perceive, generate, and interact with images, speech, and emotions end-to-end, achieving state-of-the-art results in vision-language and speech tasks.

Contribution

The paper introduces EMOVA, a comprehensive omni-modal framework that integrates vision, speech, and emotion understanding in large language models, with novel modules for semantic-acoustic disentanglement and style control.

Findings

01

Achieves state-of-the-art on vision-language benchmarks.

02

Supports omni-modal spoken dialogue with vivid emotions.

03

Enhances vision-language and speech abilities through semantic-acoustic alignment.

Abstract

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging for the open-source community. Existing vision-language models rely on external tools for speech processing, while speech-language models still suffer from limited or totally without vision-understanding capabilities. To address this gap, we propose the EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech abilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities compared…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems