TL;DR
This study reveals that current spoken language models struggle to maintain specified speaking styles over multiple turns, but explicit recall can mitigate this issue, highlighting a gap in style consistency.
Contribution
The paper systematically investigates style amnesia in spoken language models and demonstrates that explicit recall improves style maintenance, revealing a key limitation in current models.
Findings
Models cannot maintain speaking styles over multiple turns.
Explicit recall helps mitigate style amnesia.
Style instructions in system messages are less effective.
Abstract
In this paper, we show that when spoken language models (SLMs) are instructed to speak in a specific speaking style at the beginning of a multi-turn conversation, they cannot maintain the required speaking styles after several turns of interaction; we refer to this as the style amnesia of SLMs. We focus on paralinguistic speaking styles, including emotion, accent, volume, and speaking speed. We evaluate three proprietary and two open-source SLMs, demonstrating that none of these models can maintain a consistent speaking style when instructed to do so. We further show that while SLMs can recall the style instruction when prompted in later turns, they still fail to express it, but through explicit recall can mitigate style amnesia. In addition, SLMs struggle more when the style instruction is placed in system messages rather than user messages, even though system messages are specifically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
