PRiSM: Benchmarking Phone Realization in Speech Models

Shikhar Bharadwaj; Chin-Jou Li; Yoonjae Kim; Kwanghee Choi; Eunjung Yeo; Ryan Soh-Eun Shim; Hanyu Zhou; Brendon Boldt; Karen Rosero Jacome; Kalvin Chang; Darsh Agrawal; Keer Xu; Chao-Han Huck Yang; Jian Zhu; Shinji Watanabe; David R. Mortensen

arXiv:2601.14046·cs.CL·January 21, 2026

PRiSM: Benchmarking Phone Realization in Speech Models

Shikhar Bharadwaj, Chin-Jou Li, Yoonjae Kim, Kwanghee Choi, Eunjung Yeo, Ryan Soh-Eun Shim, Hanyu Zhou, Brendon Boldt, Karen Rosero Jacome, Kalvin Chang, Darsh Agrawal, Keer Xu, Chao-Han Huck Yang, Jian Zhu, Shinji Watanabe, David R. Mortensen

PDF

Open Access 3 Models 5 Datasets

TL;DR

PRiSM is a comprehensive benchmark for evaluating phonetic perception in speech models, revealing key factors like language diversity and model stability that influence performance across various applications.

Contribution

It introduces PRiSM, the first open-source benchmark for phonetic perception, combining intrinsic and extrinsic evaluations to identify blind spots in current PR systems.

Findings

01

Diverse language exposure improves PR performance.

02

Encoder-CTC models are the most stable.

03

Specialized PR models outperform Large Audio Language Models.

Abstract

Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception through intrinsic and extrinsic evaluation of PR systems. PRiSM standardizes transcription-based evaluation and assesses downstream utility in clinical, educational, and multilingual settings with transcription and representation probes. We find that diverse language exposure during training is key to PR performance, encoder-CTC models are the most stable, and specialized PR models still outperform Large Audio Language Models. PRiSM releases code, recipes, and datasets to move the field toward multilingual speech models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Face recognition and analysis · Domain Adaptation and Few-Shot Learning