EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu,, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan, Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James, T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung

TL;DR
EMOVA is a novel omni-modal model that enables large language models to perceive, generate, and interact with images, speech, and emotions end-to-end, achieving state-of-the-art results in vision-language and speech tasks.
Contribution
The paper introduces EMOVA, a comprehensive omni-modal framework that integrates vision, speech, and emotion understanding in large language models, with novel modules for semantic-acoustic disentanglement and style control.
Findings
Achieves state-of-the-art on vision-language benchmarks.
Supports omni-modal spoken dialogue with vivid emotions.
Enhances vision-language and speech abilities through semantic-acoustic alignment.
Abstract
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging for the open-source community. Existing vision-language models rely on external tools for speech processing, while speech-language models still suffer from limited or totally without vision-understanding capabilities. To address this gap, we propose the EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech abilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Quinn777/AtomThink-EMOVA-8Bmodel· 9 dl· ♡ 19 dl♡ 1
- 🤗Emova-ollm/qwen2vit600mmodel· 481 dl481 dl
- 🤗Emova-ollm/emova_speech_tokenizer_hfmodel· 132 dl· ♡ 2132 dl♡ 2
- 🤗Emova-ollm/deepseek-vl2-deepseekmoe-tinymodel· 4 dl4 dl
- 🤗Emova-ollm/deepseek-vl2-deepseekmoe-tiny_add_speech_token_4096_nostripmodel· 9 dl9 dl
- 🤗Emova-ollm/Qwen2.5-3B-Instruct_add_speech_token_4096_nostripmodel· 7 dl7 dl
- 🤗Emova-ollm/Qwen2.5-7B-Instruct_add_speech_token_4096_nostripmodel· 3 dl3 dl
- 🤗Emova-ollm/Meta-Llama-3.1-8B-Instruct_add_speech_token_4096_nostrip-2model· 5 dl5 dl
- 🤗Emova-ollm/emova-qwen-2-5-7b-hfmodel· 73 dl· ♡ 273 dl♡ 2
- 🤗Emova-ollm/emova-qwen-2-5-3b-hfmodel· 6 dl· ♡ 56 dl♡ 5
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
