Toward Cross-Domain Speech Recognition with End-to-End Models
Thai-Son Nguyen, Sebastian St\"uker, Alex Waibel

TL;DR
This paper demonstrates that neural end-to-end speech recognition models outperform hybrid models in multi-domain settings, achieving comparable or better accuracy without domain-specific adaptations.
Contribution
The study provides empirical evidence that end-to-end models generalize better across multiple domains than hybrid models, simplifying multi-domain speech recognition.
Findings
End-to-end models outperform hybrid models on diverse domains.
Multi-domain end-to-end models match domain-specific hybrid model performance.
End-to-end models eliminate the need for domain-adapted language models.
Abstract
In the area of multi-domain speech recognition, research in the past focused on hybrid acoustic models to build cross-domain and domain-invariant speech recognition systems. In this paper, we empirically examine the difference in behavior between hybrid acoustic models and neural end-to-end systems when mixing acoustic training data from several domains. For these experiments we composed a multi-domain dataset from public sources, with the different domains in the corpus covering a wide variety of topics and acoustic conditions such as telephone conversations, lectures, read speech and broadcast news. We show that for the hybrid models, supplying additional training data from other domains with mismatched acoustic conditions does not increase the performance on specific domains. However, our end-to-end models optimized with sequence-based criterion generalize better than the hybrid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
