Cross-Modal Transformer-Based Neural Correction Models for Automatic   Speech Recognition

Tomohiro Tanaka; Ryo Masumura; Mana Ihori; Akihiko Takashima; Takafumi; Moriya; Takanori Ashihara; Shota Orihashi; Naoki Makishima

arXiv:2107.01569·cs.CL·July 6, 2021

Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition

Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi, Moriya, Takanori Ashihara, Shota Orihashi, Naoki Makishima

PDF

Open Access

TL;DR

This paper introduces a novel cross-modal transformer-based neural correction model that refines ASR outputs by jointly encoding speech and text inputs using cross-modal self-attention, improving accuracy over traditional methods.

Contribution

The paper presents a new cross-modal transformer model that jointly encodes speech and text inputs for neural correction, capturing their relationships more effectively.

Findings

01

Achieved better ASR performance than conventional neural correction models.

02

Demonstrated effectiveness on Japanese natural language ASR tasks.

03

Utilized cross-modal self-attention to improve error correction.

Abstract

We propose a cross-modal transformer-based neural correction models that refines the output of an automatic speech recognition (ASR) system so as to exclude ASR errors. Generally, neural correction models are composed of encoder-decoder networks, which can directly model sequence-to-sequence mapping problems. The most successful method is to use both input speech and its ASR output text as the input contexts for the encoder-decoder networks. However, the conventional method cannot take into account the relationships between these two different modal inputs because the input contexts are separately encoded for each modal. To effectively leverage the correlated information between the two different modal inputs, our proposed models encode two different contexts jointly on the basis of cross-modal self-attention using a transformer. We expect that cross-modal self-attention can effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing