Hello-Chat: Towards Realistic Social Audio Interactions
Yueran Hou, Peilei Jia, Zihan Sun, Qihang Lu, Wenbing Yang, Yingming Gao, Ya Li, Jun Gao

TL;DR
Hello-Chat is an end-to-end audio language model that enhances social audio interactions by improving naturalness, emotional resonance, and anthropomorphic qualities through a large dataset and novel training strategies.
Contribution
The paper introduces Hello-Chat, a new model that advances social audio interaction realism by integrating a large dataset and modality-interleaved training for more natural and empathetic responses.
Findings
Achieves state-of-the-art performance on audio understanding tasks.
Outperforms baselines in prosodic naturalness.
Enhances emotional alignment in social audio interactions.
Abstract
Recent advancements in Large Audio Language Models (LALMs) have demonstrated exceptional performance in speech recognition and translation. However, existing models often suffer from a disconnect between perception and expression, resulting in a robotic "read-speech" style that lacks the spontaneity and emotional resonance of real human interaction. In this report, we introduce Hello-Chat, an end-to-end audio language model designed for realistic social scenarios. By leveraging a massive dataset of real-life conversations and employing a modality-interleaved training strategy, Hello-Chat achieves a breakthrough in anthropomorphic generation. Experimental results show that our model not only reaches state-of-the-art (SOTA) performance on specific audio understanding tasks but also significantly outperforms existing baselines in prosodic naturalness and emotional alignment, paving the way…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · AI in Service Interactions · Emotion and Mood Recognition
