GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

Hongjie Chen; Zehan Li; Yaodong Song; Wenming Deng; Yitong Yao; Yuxin Zhang; Hang Lv; Xuechao Zhu; Jian Kang; Jie Lian; Jie Li; Chao Wang; Shuangyong Song; Yongxiang Li; Zhongjiang He; Xuelong Li

arXiv:2507.18119·cs.CL·July 28, 2025

GOAT-SLM: A Spoken Language Model with Paralinguistic and Speaker Characteristic Awareness

Hongjie Chen, Zehan Li, Yaodong Song, Wenming Deng, Yitong Yao, Yuxin Zhang, Hang Lv, Xuechao Zhu, Jian Kang, Jie Lian, Jie Li, Chao Wang, Shuangyong Song, Yongxiang Li, Zhongjiang He, Xuelong Li

PDF

Open Access

TL;DR

GOAT-SLM is a novel spoken language model that incorporates paralinguistic and speaker characteristics, enabling more natural and socially aware speech interactions by modeling beyond just linguistic content.

Contribution

It introduces a dual-modality architecture and a staged training strategy to effectively integrate linguistic, paralinguistic, and speaker information in spoken language modeling.

Findings

01

Outperforms existing models in emotion, dialect, and age-sensitive tasks.

02

Achieves balanced performance across semantic and non-semantic tasks.

03

Demonstrates robustness in handling diverse speaker characteristics.

Abstract

Recent advances in end-to-end spoken language models (SLMs) have significantly improved the ability of AI systems to engage in natural spoken interactions. However, most existing models treat speech merely as a vehicle for linguistic content, often overlooking the rich paralinguistic and speaker characteristic cues embedded in human speech, such as dialect, age, emotion, and non-speech vocalizations. In this work, we introduce GOAT-SLM, a novel spoken language model with paralinguistic and speaker characteristic awareness, designed to extend spoken language modeling beyond text semantics. GOAT-SLM adopts a dual-modality head architecture that decouples linguistic modeling from acoustic realization, enabling robust language understanding while supporting expressive and adaptive speech generation. To enhance model efficiency and versatility, we propose a modular, staged training strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling