Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition
Wei Zhou, Mohammad Zeineldeen, Zuoyun Zheng, Ralf Schl\"uter, Hermann, Ney

TL;DR
This paper introduces an acoustic data-driven subword modeling approach that improves end-to-end speech recognition by producing acoustically logical and balanced subword units, outperforming existing methods like BPE and PASM.
Contribution
The paper presents a fully acoustic-oriented subword modeling method that integrates advantages of text-based and acoustic-based approaches for better ASR performance.
Findings
ADSM outperforms BPE and PASM on LibriSpeech.
ADSM produces more logical and balanced subword units.
Applicable to various end-to-end ASR models.
Abstract
Subword units are commonly used for end-to-end automatic speech recognition (ASR), while a fully acoustic-oriented subword modeling approach is somewhat missing. We propose an acoustic data-driven subword modeling (ADSM) approach that adapts the advantages of several text-based and acoustic-based subword methods into one pipeline. With a fully acoustic-oriented label design and learning process, ADSM produces acoustic-structured subword units and acoustic-matched target sequence for further ASR training. The obtained ADSM labels are evaluated with different end-to-end ASR approaches including CTC, RNN-Transducer and attention models. Experiments on the LibriSpeech corpus show that ADSM clearly outperforms both byte pair encoding (BPE) and pronunciation-assisted subword modeling (PASM) in all cases. Detailed analysis shows that ADSM achieves acoustically more logical word segmentation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
