Multi-Modal Data Augmentation for End-to-End ASR
Adithya Renduchintala, Shuoyang Ding, Matthew Wiesner, Shinji, Watanabe

TL;DR
This paper introduces a multi-modal data augmentation architecture for end-to-end ASR that integrates symbolic and acoustic inputs, improving recognition accuracy by leveraging large text datasets alongside speech data.
Contribution
The paper proposes a novel multi-modal data augmentation network (MMDA) that combines symbolic and acoustic inputs with shared parameters, enabling effective use of large text corpora for training ASR systems.
Findings
Achieved small CER improvements.
Realized 7-10% relative WER reduction.
Enhanced training with symbolic data augmentation.
Abstract
We present a new end-to-end architecture for automatic speech recognition (ASR) that can be trained using \emph{symbolic} input in addition to the traditional acoustic input. This architecture utilizes two separate encoders: one for acoustic input and another for symbolic input, both sharing the attention and decoder parameters. We call this architecture a multi-modal data augmentation network (MMDA), as it can support multi-modal (acoustic and symbolic) input and enables seamless mixing of large text datasets with significantly smaller transcribed speech corpora during training. We study different ways of transforming large text corpora into a symbolic form suitable for training our MMDA network. Our best MMDA setup obtains small improvements on character error rate (CER), and as much as 7-10\% relative word error rate (WER) improvement over a baseline both with and without an external…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing
