Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue
Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han, Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-yi Lee, Ivan Bulyko

TL;DR
This paper introduces ParalinGPT, a multimodal LLM that incorporates speech and paralinguistic cues to enhance naturalness and emotional understanding in spoken dialogue, outperforming traditional models in sentiment and response quality.
Contribution
The paper presents a novel serialized multitasking framework for multimodal LLMs that integrates speech and paralinguistic features for improved dialogue modeling.
Findings
Outperforms sequence classification in sentiment prediction
Improves response text quality with speech embeddings
Achieves significant accuracy and BLEU score improvements
Abstract
Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore propose Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT), an LLM that utilizes text and speech modalities to better model the linguistic content and paralinguistic attributes of spoken dialogue. The model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking multimodal framework. Specifically, our framework serializes tasks in the order of current paralinguistic attribute prediction,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dropout · Layer Normalization · Residual Connection · Byte Pair Encoding
