MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection
Fei Jia, Somshubra Majumdar, Boris Ginsburg

TL;DR
MarbleNet is a deep neural network utilizing 1D time-channel separable convolutions, achieving comparable voice activity detection performance to state-of-the-art models with significantly fewer parameters, and demonstrating robustness in real-world scenarios.
Contribution
The paper introduces MarbleNet, a novel deep residual network with 1D separable convolutions, reducing parameter count while maintaining high VAD accuracy.
Findings
Achieves similar performance to state-of-the-art VAD models
Uses approximately 1/10th the parameters of comparable models
Demonstrates robustness through extensive ablation studies
Abstract
We present MarbleNet, an end-to-end neural network for Voice Activity Detection (VAD). MarbleNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batch-normalization, ReLU and dropout layers. When compared to a state-of-the-art VAD model, MarbleNet is able to achieve similar performance with roughly 1/10-th the parameter cost. We further conduct extensive ablation studies on different training methods and choices of parameters in order to study the robustness of MarbleNet in real-world VAD tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
