LibriVAD: A Scalable Open Dataset with Deep Learning Benchmarks for Voice Activity Detection

Ioannis Stylianou; Achintya kr. Sarkar; Nauman Dawalatabad; James Glass; Zheng-Hua Tan

arXiv:2512.17281·cs.SD·December 22, 2025

LibriVAD: A Scalable Open Dataset with Deep Learning Benchmarks for Voice Activity Detection

Ioannis Stylianou, Achintya kr. Sarkar, Nauman Dawalatabad, James Glass, Zheng-Hua Tan

PDF

Open Access 1 Datasets

TL;DR

This paper introduces LibriVAD, a large-scale, open-source dataset for voice activity detection, and benchmarks deep learning models including a Vision Transformer, demonstrating improved performance and generalization under diverse noisy conditions.

Contribution

The paper presents LibriVAD, a scalable dataset with controlled noise conditions, and evaluates novel deep learning architectures like ViT, advancing VAD research and benchmarking standards.

Findings

01

ViT with MFCC features outperforms traditional models.

02

Scaling dataset size improves OOD generalization.

03

Balancing SSR enhances model robustness.

Abstract

Robust Voice Activity Detection (VAD) remains a challenging task, especially under noisy, diverse, and unseen acoustic conditions. Beyond algorithmic development, a key limitation in advancing VAD research is the lack of large-scale, systematically controlled, and publicly available datasets. To address this, we introduce LibriVAD - a scalable open-source dataset derived from LibriSpeech and augmented with diverse real-world and synthetic noise sources. LibriVAD enables systematic control over speech-to-noise ratio, silence-to-speech ratio (SSR), and noise diversity, and is released in three sizes (15 GB, 150 GB, and 1.5 TB) with two variants (LibriVAD-NonConcat and LibriVAD-Concat) to support different experimental setups. We benchmark multiple feature-model combinations, including waveform, Mel-Frequency Cepstral Coefficients (MFCC), and Gammatone filter bank cepstral coefficients,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

LibriVAD/LibriVAD
dataset· 119 dl
119 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hearing Loss and Rehabilitation