TL;DR
This paper introduces a hybrid CNN-BiLSTM architecture for voice activity detection that emphasizes computational efficiency and robustness in noisy environments, outperforming existing baselines on the AVA-Speech dataset.
Contribution
The paper proposes a novel hybrid CNN-BiLSTM VAD model optimized for resource-constrained settings, demonstrating improved accuracy and efficiency over existing methods.
Findings
BiLSTM layers improve accuracy by approximately 2% absolute.
Smaller models with near optimal parameters perform comparably to larger models.
The proposed system achieves an AUC of 0.951, outperforming baselines in noisy conditions.
Abstract
This paper presents a new hybrid architecture for voice activity detection (VAD) incorporating both convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) layers trained in an end-to-end manner. In addition, we focus specifically on optimising the computational efficiency of our architecture in order to deliver robust performance in difficult in-the-wild noise conditions in a severely under-resourced setting. Nested k-fold cross-validation was used to explore the hyperparameter space, and the trade-off between optimal parameters and model size is discussed. The performance effect of a BiLSTM layer compared to a unidirectional LSTM layer was also considered. We compare our systems with three established baselines on the AVA-Speech dataset. We find that significantly smaller models with near optimal parameters perform on par with larger models trained with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAverage Pooling · Batch Normalization · Global Average Pooling · Tanh Activation · Residual Connection · *Communicated@Fast*How Do I Communicate to Expedia? · 1x1 Convolution · Sigmoid Activation · Convolution · Bottleneck Residual Block
