Conversation-oriented ASR with multi-look-ahead CBS architecture
Huaibo Zhao, Shinya Fujie, Tetsuji Ogawa, Jin Sakuma, Yusuke Kida,, Tetsunori Kobayashi

TL;DR
This paper introduces a novel multi-look-ahead CBS architecture for streaming ASR that balances high accuracy and zero latency by using parallel encoders, enabling more natural and timely conversational interactions.
Contribution
The paper proposes a multi-look-ahead CBS-based streaming ASR system with parallel encoders to achieve high accuracy without delay, advancing real-time conversational speech recognition.
Findings
Achieves high accuracy with zero look-ahead latency.
Uses parallel encoders for primary and auxiliary recognition.
Employs block processing for efficient real-time recognition.
Abstract
During conversations, humans are capable of inferring the intention of the speaker at any point of the speech to prepare the following action promptly. Such ability is also the key for conversational systems to achieve rhythmic and natural conversation. To perform this, the automatic speech recognition (ASR) used for transcribing the speech in real-time must achieve high accuracy without delay. In streaming ASR, high accuracy is assured by attending to look-ahead frames, which leads to delay increments. To tackle this trade-off issue, we propose a multiple latency streaming ASR to achieve high accuracy with zero look-ahead. The proposed system contains two encoders that operate in parallel, where a primary encoder generates accurate outputs utilizing look-ahead frames, and the auxiliary encoder recognizes the look-ahead portion of the primary encoder without look-ahead. The proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
