Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion
Kumud Tripathi, Chowdam Venkata Kumar, Pankaj Wasnik

TL;DR
This paper demonstrates that simple fusion of traditional MFCC features with pre-trained model features significantly improves voice activity detection accuracy and robustness, outperforming state-of-the-art methods.
Contribution
The study introduces FusionVAD, a unified framework that effectively combines MFCC and PTM features using simple fusion strategies, enhancing VAD performance.
Findings
Simple fusion methods outperform complex attention mechanisms.
Fusion models surpass single-feature models in accuracy.
Best fusion model exceeds state-of-the-art Pyannote by 2.04%.
Abstract
Voice Activity Detection (VAD) plays a key role in speech processing, often utilizing hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionVAD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that simple fusion techniques, particularly addition, outperform CA in both accuracy and efficiency. Fusion-based models consistently surpass single-feature models, highlighting the complementary nature of MFCCs and PTM features. Notably, our best-performing fusion model exceeds the state-of-the-art Pyannote across multiple datasets, achieving an absolute average improvement of 2.04%. These results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Speech and dialogue systems
