Spoken Language Identification System for English-Mandarin Code-Switching Child-Directed Speech
Shashi Kant Gupta, Sushant Hiray, Prashant Kukde

TL;DR
This paper presents a two-stage Encoder-Decoder E2E model for robust language identification in challenging speech conditions, achieving significant error rate improvements on Singaporean child-directed code-switched speech.
Contribution
Introduces a lightweight Encoder-Decoder model tailored for non-standard, accented, and spontaneous code-switched speech, with curated datasets for public use.
Findings
Achieved 15.6% EER in closed track
Achieved 11.1% EER in open track
Curated and released additional Singaporean speech data
Abstract
This work focuses on improving the Spoken Language Identification (LangId) system for a challenge that focuses on developing robust language identification systems that are reliable for non-standard, accented (Singaporean accent), spontaneous code-switched, and child-directed speech collected via Zoom. We propose a two-stage Encoder-Decoder-based E2E model. The encoder module consists of 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with a global context. The decoder module uses an attentive temporal pooling mechanism to get fixed length time-independent feature representation. The total number of parameters in the model is around 22.1 M, which is relatively light compared to using some large-scale pre-trained speech models. We achieved an EER of 15.6% in the closed track and 11.1% in the open track (baseline system 22.1%). We also curated additional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
