PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification
Hexin Liu, Leibny Paola Garcia Perera, Andy W. H. Khong, Suzy J., Styles, Sanjeev Khudanpur

TL;DR
This paper introduces PHO-LID, a hierarchical model that combines acoustic-phonetic and phonotactic information for language identification, achieving significant improvements without needing phoneme annotations during training.
Contribution
The novel CNN-Trans architecture integrates self-supervised phoneme segmentation with language identification, enhancing performance over existing models without requiring phoneme labels.
Findings
Over 40% relative improvement in LID performance on AP17-OLR data.
Higher accuracy on languages within the same cluster in NIST LRE 2017.
Effective leveraging of phoneme information demonstrated through boundary and spectrogram analysis.
Abstract
We propose a novel model to hierarchically incorporate phoneme and phonotactic information for language identification (LID) without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of phonotactic embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. We call this architecture CNN-Trans. We evaluate it on AP17-OLR data and the MLS14 set of NIST LRE 2017, and show that the PHO-LID model with multi-task optimization exhibits the highest LID performance among all models, achieving over 40% relative improvement in terms of average cost on AP17-OLR data compared to a CNN-Trans model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
