ZIPA: A family of efficient models for multilingual phone recognition
Jian Zhu, Farhan Samir, Eleanor Chodroff, David R. Mortensen

TL;DR
ZIPA introduces a family of efficient multilingual speech models that significantly improve crosslinguistic phone recognition performance using large-scale data and novel architectures, while highlighting ongoing challenges in sociophonetic diversity modeling.
Contribution
The paper presents ZIPA, a new family of efficient multilingual speech models with large-scale training data and novel architectures, advancing crosslinguistic phone recognition.
Findings
ZIPA models outperform existing systems with fewer parameters.
Scaling with noisy student training improves performance further.
Persistent challenges remain in modeling sociophonetic diversity.
Abstract
We present ZIPA, a family of efficient speech models that advances the state-of-the-art performance of crosslinguistic phone recognition. We first curated IPAPack++, a large-scale multilingual speech corpus with 17,132 hours of normalized phone transcriptions and a novel evaluation set capturing unseen languages and sociophonetic variation. With the large-scale training data, ZIPA, including transducer (ZIPA-T) and CTC-based (ZIPA-CR) variants, leverage the efficient Zipformer backbones and outperform existing phone recognition systems with much fewer parameters. Further scaling via noisy student training on 11,000 hours of pseudo-labeled multilingual data yields further improvement. While ZIPA achieves strong performance on benchmarks, error analysis reveals persistent limitations in modeling sociophonetic diversity, underscoring challenges for future research.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · Authorship Attribution and Profiling
MethodsStochastic Depth · Dropout · RandAugment · Noisy Student · Sparse Evolutionary Training
