Real-time Speaker counting in a cocktail party scenario using Attention-guided Convolutional Neural Network
Midia Yousefi, John H.L. Hansen

TL;DR
This paper introduces a real-time, attention-guided CNN system for estimating the number of active speakers in overlapping speech scenarios, addressing the challenge of unknown speaker count in real-world applications.
Contribution
It proposes a novel attention-guided CNN architecture that improves speaker counting accuracy in overlapping speech, especially for short segments, compared to traditional methods.
Findings
Attention mechanism improves performance by nearly 3% over average pooling.
Achieves over 92% accuracy in offline scenarios with longer input signals.
Maintains high precision and recall on short speech segments.
Abstract
Most current speech technology systems are designed to operate well even in the presence of multiple active speakers. However, most solutions assume that the number of co-current speakers is known. Unfortunately, this information might not always be available in real-world applications. In this study, we propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech. The proposed system extracts higher-level information from the speech spectral content using a CNN model. Next, the attention mechanism summarizes the extracted information into a compact feature vector without losing critical information. Finally, the active speakers are classified using a fully connected network. Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
