Reducing Language confusion for Code-switching Speech Recognition with Token-level Language Diarization
Hexin Liu, Haihua Xu, Leibny Paola Garcia, Andy W. H. Khong, Yi He,, Sanjeev Khudanpur

TL;DR
This paper proposes methods to reduce language confusion in code-switching speech recognition by incorporating token-level language information and disentangling language features, leading to improved recognition accuracy.
Contribution
It introduces a novel approach combining language posterior bias and adversarial disentangling to enhance code-switching speech recognition performance.
Findings
Language posterior bias outperforms disentangling in reducing confusion.
Joint optimization with language diarization improves recognition accuracy.
Incorporating language information is more effective than disentangling.
Abstract
Code-switching (CS) refers to the phenomenon that languages switch within a speech signal and leads to language confusion for automatic speech recognition (ASR). This paper aims to address language confusion for improving CS-ASR from two perspectives: incorporating and disentangling language information. We incorporate language information in the CS-ASR model by dynamically biasing the model with token-level language posteriors which are outputs of a sequence-to-sequence auxiliary language diarization module. In contrast, the disentangling process reduces the difference between languages via adversarial training so as to normalize two languages. We conduct the experiments on the SEAME dataset. Compared to the baseline model, both the joint optimization with LD and the language posterior bias achieve performance improvement. The comparison of the proposed methods indicates that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
