TL;DR
This paper introduces PhoneticXEUS, a multilingual phone recognition model trained on large-scale data, achieving state-of-the-art results and providing insights into factors affecting performance across languages and accents.
Contribution
It presents a new training recipe for multilingual PR, evaluates the impact of data scale, SSL representations, and loss objectives, and analyzes error patterns across diverse speech conditions.
Findings
Achieved 17.7% PFER on multilingual speech
Achieved 10.6% PFER on accented English
Quantified effects of data scale, SSL, and loss functions
Abstract
Phone recognition (PR) is a key enabler of multilingual and low-resource speech processing tasks, yet robust performance remains elusive. Highly performant English-focused models do not generalize across languages, while multilingual models underutilize pretrained representations. It also remains unclear how data scale, architecture, and training objective contribute to multilingual PR. We present PhoneticXEUS -- trained on large-scale multilingual data and achieving state-of-the-art performance on both multilingual (17.7% PFER) and accented English speech (10.6% PFER). Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives. In addition, we analyze error patterns across language families, accented speech, and articulatory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
