From Babble to Words: Pre-Training Language Models on Continuous Streams   of Phonemes

Z\'ebulon Goriely; Richard Diehl Martinez; Andrew Caines; Lisa; Beinborn; Paula Buttery

arXiv:2410.22906·cs.CL·October 31, 2024

From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Z\'ebulon Goriely, Richard Diehl Martinez, Andrew Caines, Lisa, Beinborn, Paula Buttery

PDF

Open Access 1 Repo 9 Models 2 Datasets

TL;DR

This paper explores pre-training language models on continuous phoneme streams instead of text, revealing potential benefits for phonological understanding despite slight performance trade-offs.

Contribution

It introduces a pipeline to convert text datasets into phoneme streams and evaluates phoneme-based pre-training on standard benchmarks.

Findings

01

Phoneme-based training slightly reduces traditional task performance

02

Provides deeper insights into phonological language acquisition

03

Enables sound-based task improvements

Abstract

Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

codebyzeb/Corpus-Phonemizer
noneOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling