Text-Only Domain Adaptation for End-to-End Speech Recognition through   Down-Sampling Acoustic Representation

Jiaxu Zhu; Weinan Tong; Yaoxun Xu; Changhe Song; Zhiyong Wu; Zhao You,; Dan Su; Dong Yu; Helen Meng

arXiv:2309.02459·cs.SD·October 10, 2023

Text-Only Domain Adaptation for End-to-End Speech Recognition through Down-Sampling Acoustic Representation

Jiaxu Zhu, Weinan Tong, Yaoxun Xu, Changhe Song, Zhiyong Wu, Zhao You,, Dan Su, Dong Yu, Helen Meng

PDF

Open Access

TL;DR

This paper introduces a novel down-sampling strategy using a CIF module to align acoustic and text representations, enabling improved domain adaptation for end-to-end speech recognition with text-only data.

Contribution

It proposes a new representation matching method that down-samples acoustic features to better align with text, enhancing domain adaptation in speech recognition.

Findings

01

Effective domain adaptation demonstrated on new domain data

02

Improved alignment of acoustic and text representations

03

Enhanced performance with text-only data in ASR

Abstract

Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsALIGN