Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units
Anurag Katakkar, Alan W Black

TL;DR
This paper introduces a novel LSTM-based generative speech language model using linguistic units like syllables and phonemes, demonstrating promising results with limited data and exploring training challenges.
Contribution
The paper presents a new speech language model based on linguistic units, addressing acoustic consistency and training challenges in the speech domain.
Findings
Model closely approximates babbling speech with limited data
Training with auxiliary text LMs and articulatory features impacts performance
Validation metrics like MCD may not correlate with speech quality
Abstract
Language models (LMs) for text data have been studied extensively for their usefulness in language generation and other downstream tasks. However, language modelling purely in the speech domain is still a relatively unexplored topic, with traditional speech LMs often depending on auxiliary text LMs for learning distributional aspects of the language. For the English language, these LMs treat words as atomic units, which presents inherent challenges to language modelling in the speech domain. In this paper, we propose a novel LSTM-based generative speech LM that is inspired by the CBOW model and built on linguistic units including syllables and phonemes. This offers better acoustic consistency across utterances in the dataset, as opposed to single melspectrogram frames, or whole words. With a limited dataset, orders of magnitude smaller than that required by contemporary generative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
