How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

Disen Liao; Freda Shi

arXiv:2604.17105·cs.CL·April 21, 2026

How Tokenization Limits Phonological Knowledge Representation in Language Models and How to Improve Them

Disen Liao, Freda Shi

PDF

TL;DR

This paper reveals that current tokenization methods hinder phonological knowledge in language models and introduces an IPA-based fine-tuning approach to improve phonological representations with minimal performance loss.

Contribution

It identifies the impact of tokenization on phonological encoding and proposes a novel IPA-based fine-tuning method to enhance phonological awareness in language models.

Findings

01

Subword tokenization weakens phonological feature encoding.

02

Higher syllabification-tokenization alignment distance correlates with poorer phonological representations.

03

IPA-based fine-tuning improves phonology-related tasks with minimal impact on general reasoning.

Abstract

Tokenization is the first step in every language model (LM), yet it never takes the sounds of words into account. We investigate how tokenization influences text-only LMs' ability to represent phonological knowledge. Through a series of probing experiments, we show that subword-based tokenization systematically weakens the encoding of both local (e.g., rhyme) and global (e.g., syllabification) phonological features. To quantify this effect, we introduce the syllabification-tokenization alignment distance (STAD), a metric that measures the misalignment between a model's tokenization and the natural syllable boundaries of words, and find that higher misalignment correlates with poorer phonological representations, providing a simple diagnostic for phonology-aware tokenization. To address these limitations, we propose a lightweight IPA-based fine-tuning method that infuses phonological…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.