CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing
Abdul Rehman, Jian-Jun Zhang, and Xiaosong Yang

TL;DR
CUPE is a lightweight, language-agnostic phoneme encoder that captures essential phonetic features in short segments, enabling effective cross-lingual speech processing with fewer parameters.
Contribution
The paper introduces CUPE, a novel model that processes short speech windows independently to learn universal phoneme features across languages, requiring less data and computational resources.
Findings
Achieves competitive cross-lingual performance
Effective in zero-shot language transfer
Models fundamental acoustic patterns within phoneme-length windows
Abstract
Universal phoneme recognition typically requires analyzing long speech segments and language-specific patterns. Many speech processing tasks require pure phoneme representations free from contextual influence, which motivated our development of CUPE - a lightweight model that captures key phoneme features in just 120 milliseconds, about one phoneme's length. CUPE processes short, fixed-width windows independently and, despite fewer parameters than current approaches, achieves competitive cross-lingual performance by learning fundamental acoustic patterns common to all languages. Our extensive evaluation through supervised and self-supervised training on diverse languages, including zero-shot tests on the UCLA Phonetic Corpus, demonstrates strong cross-lingual generalization and reveals that effective universal speech processing is possible through modeling basic acoustic patterns within…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Music and Audio Processing
