Conversational Speech Recognition By Learning Conversation-level Characteristics
Kun Wei, Yike Zhang, Sining Sun, Lei Xie, Long Ma

TL;DR
This paper introduces a novel end-to-end conversational speech recognition model that learns conversation-level features like role preference and topical coherence, improving accuracy on Mandarin tasks.
Contribution
It proposes a new model combining a latent variational module and a topic model to explicitly incorporate conversation-level characteristics into ASR.
Findings
Achieves up to 12% CER reduction on Mandarin conversational ASR tasks.
Demonstrates the effectiveness of conversation-level features in improving recognition accuracy.
Validates the model's ability to learn and utilize conversation context.
Abstract
Conversational automatic speech recognition (ASR) is a task to recognize conversational speech including multiple speakers. Unlike sentence-level ASR, conversational ASR can naturally take advantages from specific characteristics of conversation, such as role preference and topical coherence. This paper proposes a conversational ASR model which explicitly learns conversation-level characteristics under the prevalent end-to-end neural framework. The highlights of the proposed model are twofold. First, a latent variational module (LVM) is attached to a conformer-based encoder-decoder ASR backbone to learn role preference and topical coherence. Second, a topic model is specifically adopted to bias the outputs of the decoder to words in the predicted topics. Experiments on two Mandarin conversational ASR tasks show that the proposed model achieves a maximum 12% relative character error rate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
