Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models
Alexander Polok, Santosh Kesiraju, Karel Bene\v{s}, Luk\'a\v{s}, Burget, Jan \v{C}ernock\'y

TL;DR
This paper introduces DeCRED, a decoder-centric regularisation method for encoder-decoder ASR models that enhances robustness and out-of-domain generalisation, leading to improved WERs with less data and smaller models.
Contribution
The paper proposes a novel regularisation approach, DeCRED, with auxiliary classifiers in the decoder, improving ASR performance and out-of-domain robustness over existing models.
Findings
DeCRED improves WER by 2.7-2.9 on AMI and Gigaspeech datasets.
DeCRED enhances out-of-domain generalisation.
Strong baseline models achieve competitive results with less data.
Abstract
This paper proposes a simple yet effective way of regularising the encoder-decoder-based automatic speech recognition (ASR) models that enhance the robustness of the model and improve the generalisation to out-of-domain scenarios. The proposed approach is dubbed as coder-entric egularisation in ncoder-ecoder (DeCRED) architecture for ASR, where auxiliary classifier(s) is introduced in layers of the decoder module. Leveraging these classifiers, we propose two decoding strategies that re-estimate the next token probabilities. Using the recent E-branchformer architecture, we build strong ASR systems that obtained competitive WERs as compared to Whisper-medium and outperformed OWSM v3; while relying only on a fraction of training data and model size. On top of such a strong baseline, we show that DeCRED can further improve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsE-Branchformer
