TL;DR
Spatial LibriSpeech is a large, augmented spatial audio dataset designed to facilitate machine learning for spatial audio tasks, including source localization and room acoustics estimation, with models demonstrating strong generalization across datasets.
Contribution
We introduce Spatial LibriSpeech, a comprehensive augmented dataset for spatial audio learning, enabling improved model training and evaluation for various spatial audio tasks.
Findings
Models trained on our dataset achieve median errors of 6.60° in 3D source localization.
The dataset enables models to generalize well to other evaluation datasets.
Our models also accurately estimate room acoustics parameters across different datasets.
Abstract
We present Spatial LibriSpeech, a spatial audio dataset with over 650 hours of 19-channel audio, first-order ambisonics, and optional distractor noise. Spatial LibriSpeech is designed for machine learning model training, and it includes labels for source position, speaking direction, room acoustics and geometry. Spatial LibriSpeech is generated by augmenting LibriSpeech samples with 200k+ simulated acoustic conditions across 8k+ synthetic rooms. To demonstrate the utility of our dataset, we train models on four spatial audio tasks, resulting in a median absolute error of 6.60{\deg} on 3D source localization, 0.43m on distance, 90.66ms on T30, and 2.74dB on DRR estimation. We show that the same models generalize well to widely-used evaluation datasets, e.g., obtaining a median absolute error of 12.43{\deg} on 3D source localization on TUT Sound Events 2018, and 157.32ms on T30 estimation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
