The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR

Siyu Liang; Nicolas Ballier; Gina-Anne Levow; Richard Wright

arXiv:2510.22492·cs.CL·October 28, 2025

The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR

Siyu Liang, Nicolas Ballier, Gina-Anne Levow, Richard Wright

PDF

TL;DR

This paper investigates how multilingual ASR models utilize sub-tokens across languages, revealing that sub-token discovery saturates quickly regardless of data disparity, and that linguistic features influence token usage more than training data volume.

Contribution

The study provides an empirical analysis of sub-token utilization in multilingual ASR, introducing the concept of acoustic saturation time and highlighting the influence of linguistic features on token discovery.

Findings

01

Sub-token discovery saturates exponentially across languages.

02

Data disparity does not significantly affect lexical diversity.

03

Linguistic and orthographic features influence token utilization patterns.

Abstract

How much audio is needed to fully observe a multilingual ASR model's learned sub-token inventory across languages, and does data disparity in multilingual pre-training affect how these tokens are utilized during inference? We address this question by analyzing Whisper's decoding behavior during inference across 49 languages. By logging decoding candidate sub-tokens and tracking their cumulative discovery over time, we study the utilization pattern of the model's sub-token space. Results show that the total number of discovered tokens remains largely independent of a language's pre-training hours, indicating that data disparity does not strongly influence lexical diversity in the model's hypothesis space. Sub-token discovery rates follow a consistent exponential saturation pattern across languages, suggesting a stable time window after which additional audio yields minimal new sub-token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.