TL;DR
This paper addresses the challenges of language modeling for code-switched speech by proposing an ASR-motivated evaluation setup, demonstrating the effectiveness of discriminative training, and exploring training protocols involving monolingual and code-switched data.
Contribution
It introduces a new evaluation framework for code-switching language models, advocates discriminative training over generative models, and shows benefits of combining monolingual and code-switched data for training.
Findings
Discriminative models outperform generative models in code-switching tasks.
The proposed evaluation setup isolates language modeling performance from ASR system complexities.
Training with large monolingual data followed by fine-tuning improves performance on code-switching language modeling.
Abstract
We focus on the problem of language modeling for code-switched language, in the context of automatic speech recognition (ASR). Language modeling for code-switched language is challenging for (at least) three reasons: (1) lack of available large-scale code-switched data for training; (2) lack of a replicable evaluation setup that is ASR directed yet isolates language modeling performance from the other intricacies of the ASR system; and (3) the reliance on generative modeling. We tackle these three issues: we propose an ASR-motivated evaluation setup which is decoupled from an ASR system and the choice of vocabulary, and provide an evaluation dataset for English-Spanish code-switching. This setup lends itself to a discriminative training approach, which we demonstrate to work better than generative language modeling. Finally, we explore a variety of training protocols and verify the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
