Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition

Wei Zhou; Mohammad Zeineldeen; Zuoyun Zheng; Ralf Schl\"uter; Hermann; Ney

arXiv:2104.09106·cs.CL·October 24, 2023

Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition

Wei Zhou, Mohammad Zeineldeen, Zuoyun Zheng, Ralf Schl\"uter, Hermann, Ney

PDF

Open Access

TL;DR

This paper introduces an acoustic data-driven subword modeling approach that improves end-to-end speech recognition by producing acoustically logical and balanced subword units, outperforming existing methods like BPE and PASM.

Contribution

The paper presents a fully acoustic-oriented subword modeling method that integrates advantages of text-based and acoustic-based approaches for better ASR performance.

Findings

01

ADSM outperforms BPE and PASM on LibriSpeech.

02

ADSM produces more logical and balanced subword units.

03

Applicable to various end-to-end ASR models.

Abstract

Subword units are commonly used for end-to-end automatic speech recognition (ASR), while a fully acoustic-oriented subword modeling approach is somewhat missing. We propose an acoustic data-driven subword modeling (ADSM) approach that adapts the advantages of several text-based and acoustic-based subword methods into one pipeline. With a fully acoustic-oriented label design and learning process, ADSM produces acoustic-structured subword units and acoustic-matched target sequence for further ASR training. The obtained ADSM labels are evaluated with different end-to-end ASR approaches including CTC, RNN-Transducer and attention models. Experiments on the LibriSpeech corpus show that ADSM clearly outperforms both byte pair encoding (BPE) and pronunciation-assisted subword modeling (PASM) in all cases. Detailed analysis shows that ADSM achieves acoustically more logical word segmentation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques