Conversational Speech Recognition By Learning Conversation-level   Characteristics

Kun Wei; Yike Zhang; Sining Sun; Lei Xie; Long Ma

arXiv:2202.07855·cs.SD·February 18, 2022

Conversational Speech Recognition By Learning Conversation-level Characteristics

Kun Wei, Yike Zhang, Sining Sun, Lei Xie, Long Ma

PDF

Open Access

TL;DR

This paper introduces a novel end-to-end conversational speech recognition model that learns conversation-level features like role preference and topical coherence, improving accuracy on Mandarin tasks.

Contribution

It proposes a new model combining a latent variational module and a topic model to explicitly incorporate conversation-level characteristics into ASR.

Findings

01

Achieves up to 12% CER reduction on Mandarin conversational ASR tasks.

02

Demonstrates the effectiveness of conversation-level features in improving recognition accuracy.

03

Validates the model's ability to learn and utilize conversation context.

Abstract

Conversational automatic speech recognition (ASR) is a task to recognize conversational speech including multiple speakers. Unlike sentence-level ASR, conversational ASR can naturally take advantages from specific characteristics of conversation, such as role preference and topical coherence. This paper proposes a conversational ASR model which explicitly learns conversation-level characteristics under the prevalent end-to-end neural framework. The highlights of the proposed model are twofold. First, a latent variational module (LVM) is attached to a conformer-based encoder-decoder ASR backbone to learn role preference and topical coherence. Second, a topic model is specifically adopted to bias the outputs of the decoder to words in the predicted topics. Experiments on two Mandarin conversational ASR tasks show that the proposed model achieves a maximum 12% relative character error rate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing