How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them
Disen Liao, Freda Shi

TL;DR
This paper reveals that current tokenization methods hinder phonological knowledge in language models and introduces an IPA-based fine-tuning approach to improve phonological representations with minimal performance loss.
Contribution
It identifies the impact of tokenization on phonological encoding and proposes a novel IPA-based fine-tuning method to enhance phonological awareness in language models.
Findings
Subword tokenization weakens phonological feature encoding.
Higher syllabification-tokenization alignment distance correlates with poorer phonological representations.
IPA-based fine-tuning improves phonology-related tasks with minimal impact on general reasoning.
Abstract
Tokenization is the first step in every language model (LM), yet it never takes the sounds of words into account. We investigate how tokenization influences text-only LMs' ability to represent phonological knowledge. Through a series of probing experiments, we show that subword-based tokenization systematically weakens the encoding of both local (e.g., rhyme) and global (e.g., syllabification) phonological features. To quantify this effect, we introduce the syllabification-tokenization alignment distance (STAD), a metric that measures the misalignment between a model's tokenization and the natural syllable boundaries of words, and find that higher misalignment correlates with poorer phonological representations, providing a simple diagnostic for phonology-aware tokenization. To address these limitations, we propose a lightweight IPA-based fine-tuning method that infuses phonological…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
