Convoifilter: A case study of doing cocktail party speech recognition

Thai-Binh Nguyen; Alexander Waibel

arXiv:2308.11380·cs.SD·April 9, 2024

Convoifilter: A case study of doing cocktail party speech recognition

Thai-Binh Nguyen, Alexander Waibel

PDF

Open Access 1 Models

TL;DR

This paper introduces ConVoiFilter, an end-to-end speech recognition model that combines speech enhancement and ASR, significantly reducing word error rate in noisy, crowded environments through joint fine-tuning.

Contribution

The paper presents a novel joint fine-tuning approach for integrated speech enhancement and recognition, improving accuracy over independently tuned components.

Findings

01

WER reduced from 80% to 26.4% with enhancement alone

02

Joint fine-tuning further reduces WER to 14.5%

03

Open-source pre-trained model available for research

Abstract

This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise (ConVoiFilter) and an ASR module. The model can decrease ASR's word error rate (WER) from 80% to 26.4% through this approach. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning. We openly share our pre-trained model to foster further research hf.co/nguyenvulebinh/voice-filter.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
nguyenvulebinh/voice-filter
model· 3.4k dl· ♡ 4
3.4k dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Indoor and Outdoor Localization Technologies