Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR
Kun Wei, Yike Zhang, Sining Sun, Lei Xie, Long Ma

TL;DR
This paper introduces a cross-modal learning approach that directly extracts contextual representations from speech to improve conversational ASR, reducing errors by up to 16% on Mandarin datasets.
Contribution
It proposes a novel audio-textual cross-modal representation extractor that captures speech-text relationships without relying on recognition hypotheses, enhancing conversational ASR performance.
Findings
Achieved up to 16% CER reduction on MagicData dataset.
Effectively captures cross-modal context dependencies.
Improves ASR accuracy by directly modeling speech and text relationships.
Abstract
Leveraging context information is an intuitive idea to improve performance on conversational automatic speech recognition(ASR). Previous works usually adopt recognized hypotheses of historical utterances as preceding context, which may bias the current recognized hypothesis due to the inevitable historicalrecognition errors. To avoid this problem, we propose an audio-textual cross-modal representation extractor to learn contextual representations directly from preceding speech. Specifically, it consists of two modal-related encoders, extracting high-level latent features from speech and the corresponding text, and a cross-modal encoder, which aims to learn the correlation between speech and text. We randomly mask some input tokens and input sequences of each modality. Then a token-missing or modal-missing prediction with a modal-level CTC loss on the cross-modal encoder is performed.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing
MethodsConnectionist Temporal Classification Loss
