Transducer-based language embedding for spoken language identification
Peng Shen, Xugang Lu, Hisashi Kawai

TL;DR
This paper introduces a transducer-based language embedding method that combines acoustic and linguistic features to improve spoken language identification accuracy, demonstrating significant performance gains on large multilingual datasets.
Contribution
The paper presents a novel RNN transducer-based language embedding approach that explicitly encodes linguistic features for enhanced LID performance.
Findings
Significant accuracy improvements on in-domain datasets.
Notable performance gains on cross-domain datasets.
Effective integration of phonetic and linguistic features.
Abstract
The acoustic and linguistic features are important cues for the spoken language identification (LID) task. Recent advanced LID systems mainly use acoustic features that lack the usage of explicit linguistic feature encoding. In this paper, we propose a novel transducer-based language embedding approach for LID tasks by integrating an RNN transducer model into a language embedding framework. Benefiting from the advantages of the RNN transducer's linguistic representation capability, the proposed method can exploit both phonetically-aware acoustic features and explicit linguistic features for LID tasks. Experiments were carried out on the large-scale multilingual LibriSpeech and VoxLingua107 datasets. Experimental results showed the proposed method significantly improves the performance on LID tasks with 12% to 59% and 16% to 24% relative improvement on in-domain and cross-domain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
