Leveraging Acoustic Contextual Representation by Audio-textual   Cross-modal Learning for Conversational ASR

Kun Wei; Yike Zhang; Sining Sun; Lei Xie; Long Ma

arXiv:2207.01039·eess.AS·July 5, 2022

Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR

Kun Wei, Yike Zhang, Sining Sun, Lei Xie, Long Ma

PDF

Open Access

TL;DR

This paper introduces a cross-modal learning approach that directly extracts contextual representations from speech to improve conversational ASR, reducing errors by up to 16% on Mandarin datasets.

Contribution

It proposes a novel audio-textual cross-modal representation extractor that captures speech-text relationships without relying on recognition hypotheses, enhancing conversational ASR performance.

Findings

01

Achieved up to 16% CER reduction on MagicData dataset.

02

Effectively captures cross-modal context dependencies.

03

Improves ASR accuracy by directly modeling speech and text relationships.

Abstract

Leveraging context information is an intuitive idea to improve performance on conversational automatic speech recognition(ASR). Previous works usually adopt recognized hypotheses of historical utterances as preceding context, which may bias the current recognized hypothesis due to the inevitable historicalrecognition errors. To avoid this problem, we propose an audio-textual cross-modal representation extractor to learn contextual representations directly from preceding speech. Specifically, it consists of two modal-related encoders, extracting high-level latent features from speech and the corresponding text, and a cross-modal encoder, which aims to learn the correlation between speech and text. We randomly mask some input tokens and input sequences of each modality. Then a token-missing or modal-missing prediction with a modal-level CTC loss on the cross-modal encoder is performed.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing

MethodsConnectionist Temporal Classification Loss