From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes
Z\'ebulon Goriely, Richard Diehl Martinez, Andrew Caines, Lisa, Beinborn, Paula Buttery

TL;DR
This paper explores pre-training language models on continuous phoneme streams instead of text, revealing potential benefits for phonological understanding despite slight performance trade-offs.
Contribution
It introduces a pipeline to convert text datasets into phoneme streams and evaluates phoneme-based pre-training on standard benchmarks.
Findings
Phoneme-based training slightly reduces traditional task performance
Provides deeper insights into phonological language acquisition
Enables sound-based task improvements
Abstract
Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗phonemetransformers/GPT2-85M-BPE-TXTmodel· 100 dl· ♡ 1100 dl♡ 1
- 🤗phonemetransformers/GPT2-85M-BPE-PHONmodel· 85 dl85 dl
- 🤗phonemetransformers/GPT2-85M-BPE-TXT-SPACELESSmodel· 22 dl22 dl
- 🤗phonemetransformers/GPT2-85M-BPE-PHON-SPACELESSmodel· 10 dl10 dl
- 🤗phonemetransformers/GPT2-85M-CHAR-TXTmodel· 15 dl15 dl
- 🤗phonemetransformers/GPT2-85M-CHAR-TXT-SPACELESSmodel· 16 dl16 dl
- 🤗phonemetransformers/GPT2-85M-CHAR-PHON-SPACELESSmodel· 18 dl18 dl
- 🤗phonemetransformers/GPT2-85M-CHAR-PHONmodel· 95 dl95 dl
- 🤗phonemetransformers/babble-tokenizersmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
