Convoifilter: A case study of doing cocktail party speech recognition
Thai-Binh Nguyen, Alexander Waibel

TL;DR
This paper introduces ConVoiFilter, an end-to-end speech recognition model that combines speech enhancement and ASR, significantly reducing word error rate in noisy, crowded environments through joint fine-tuning.
Contribution
The paper presents a novel joint fine-tuning approach for integrated speech enhancement and recognition, improving accuracy over independently tuned components.
Findings
WER reduced from 80% to 26.4% with enhancement alone
Joint fine-tuning further reduces WER to 14.5%
Open-source pre-trained model available for research
Abstract
This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise (ConVoiFilter) and an ASR module. The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning. We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Indoor and Outdoor Localization Technologies
