Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach
Maxime Poli, Emmanuel Chemla, Emmanuel Dupoux

TL;DR
This paper demonstrates that fine-tuning speech models on phoneme classification enhances their language understanding capabilities, enabling more natural speech modeling with less data compared to traditional speech-only systems.
Contribution
The study introduces a simple fine-tuning approach on phoneme classification that improves speech representation models for more natural language understanding.
Findings
Phoneme fine-tuning yields more context-invariant speech representations.
Language models trained on phoneme units achieve comparable lexical comprehension with significantly less data.
Fine-tuned models outperform baseline speech models in understanding tasks.
Abstract
Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. Generating speech through a pipeline that operates at the text level typically loses nuances, intonations, and non-verbal vocalizations. Modeling directly from speech opens up the path to more natural and expressive systems. On the other hand, speech-only systems require up to three orders of magnitude more data to catch up to their text-based counterparts in terms of their semantic abilities. We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations, and language models trained on these units achieve comparable lexical comprehension to ones trained on hundred times more data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
