Analysis of Domain Shift across ASR Architectures via TTS-Enabled Separation of Target Domain and Acoustic Conditions
Tina Raissi, Nick Rossenbach, Ralf Schl\"uter

TL;DR
This paper investigates how different ASR architectures and modeling choices perform under domain mismatch, using TTS-generated target domain data to isolate language effects and assess generalization.
Contribution
It provides the first controlled comparison of various ASR architectures under domain shift, highlighting the impact of specific modeling choices on performance.
Findings
Modeling choices significantly influence ASR performance under domain shift.
Seq2seq and modular architectures show similar robustness when optimized.
Target domain adaptation improves recognition without retraining acoustic models.
Abstract
We analyze automatic speech recognition (ASR) modeling choices under domain mismatch, comparing classic modular and novel sequence-to-sequence (seq2seq) architectures. Across the different ASR architectures, we examine a spectrum of modeling choices, including label units, context length, and topology. To isolate language domain effects from acoustic variation, we synthesize target domain audio using a text-to-speech system trained on LibriSpeech. We incorporate target domain n-gram and neural language models for domain adaptation without retraining the acoustic model. To our knowledge, this is the first controlled comparison of optimized ASR systems across state-of-the-art architectures under domain shift, offering insights into their generalization. The results show that, under domain shift, rather than the decoder architecture choice or the distinction between classic modular and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
