PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic   Information for Language Identification

Hexin Liu; Leibny Paola Garcia Perera; Andy W. H. Khong; Suzy J.; Styles; Sanjeev Khudanpur

arXiv:2203.12366·eess.AS·April 1, 2022·Interspeech

PHO-LID: A Unified Model Incorporating Acoustic-Phonetic and Phonotactic Information for Language Identification

Hexin Liu, Leibny Paola Garcia Perera, Andy W. H. Khong, Suzy J., Styles, Sanjeev Khudanpur

PDF

Open Access 1 Repo

TL;DR

This paper introduces PHO-LID, a hierarchical model that combines acoustic-phonetic and phonotactic information for language identification, achieving significant improvements without needing phoneme annotations during training.

Contribution

The novel CNN-Trans architecture integrates self-supervised phoneme segmentation with language identification, enhancing performance over existing models without requiring phoneme labels.

Findings

01

Over 40% relative improvement in LID performance on AP17-OLR data.

02

Higher accuracy on languages within the same cluster in NIST LRE 2017.

03

Effective leveraging of phoneme information demonstrated through boundary and spectrogram analysis.

Abstract

We propose a novel model to hierarchically incorporate phoneme and phonotactic information for language identification (LID) without requiring phoneme annotations for training. In this model, named PHO-LID, a self-supervised phoneme segmentation task and a LID task share a convolutional neural network (CNN) module, which encodes both language identity and sequential phonemic information in the input speech to generate an intermediate sequence of phonotactic embeddings. These embeddings are then fed into transformer encoder layers for utterance-level LID. We call this architecture CNN-Trans. We evaluate it on AP17-OLR data and the MLS14 set of NIST LRE 2017, and show that the PHO-LID model with multi-task optimization exhibits the highest LID performance among all models, achieving over 40% relative improvement in terms of average cost on AP17-OLR data compared to a CNN-Trans model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lhx94as/pho-lid
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques