Small Language Models Also Work With Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas
Bastian Bunzeck, Daniel Duran, Leonie Schade, Sina Zarrie{\ss}

TL;DR
This paper shows that small, tokenization-free language models based on phonemes or graphemes can effectively learn linguistic structures, challenging the reliance on subword tokenization and offering more developmentally plausible models.
Contribution
It introduces and evaluates small, tokenization-free language models using phoneme and grapheme vocabularies, demonstrating their strong linguistic capabilities.
Findings
Small phoneme- and grapheme-based models perform well on syntactic and lexical benchmarks.
Phoneme-based models nearly match grapheme-based models in standard tasks.
Tokenization-free models offer a more linguistically plausible approach for language modeling.
Abstract
Recent work investigates whether LMs learn human-like linguistic generalizations and representations from developmentally plausible amounts of data. Yet, the basic linguistic units processed in these LMs are determined by subword-based tokenization, which limits their validity as models of learning at and below the word level. In this paper, we explore the potential of tokenization-free, phoneme- and grapheme-based language models. We demonstrate that small models based on the Llama architecture can achieve strong linguistic performance on standard syntactic and novel lexical/phonetic benchmarks when trained with character-level vocabularies. We further show that phoneme-based models almost match grapheme-based models in standard tasks and novel evaluations. Our findings suggest a promising direction for creating more linguistically plausible language models that are better suited for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage Development and Disorders · Speech and dialogue systems
MethodsLLaMA
