Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning

Miguel Sarabia; Elena Menyaylenko; Alessandro Toso; Skyler Seto,; Zakaria Aldeneh; Shadi Pirhosseinloo; Luca Zappella; Barry-John Theobald,; Nicholas Apostoloff; Jonathan Sheaffer

arXiv:2308.09514·cs.SD·August 21, 2023

Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning

Miguel Sarabia, Elena Menyaylenko, Alessandro Toso, Skyler Seto,, Zakaria Aldeneh, Shadi Pirhosseinloo, Luca Zappella, Barry-John Theobald,, Nicholas Apostoloff, Jonathan Sheaffer

PDF

1 Repo

TL;DR

Spatial LibriSpeech is a large, augmented spatial audio dataset designed to facilitate machine learning for spatial audio tasks, including source localization and room acoustics estimation, with models demonstrating strong generalization across datasets.

Contribution

We introduce Spatial LibriSpeech, a comprehensive augmented dataset for spatial audio learning, enabling improved model training and evaluation for various spatial audio tasks.

Findings

01

Models trained on our dataset achieve median errors of 6.60° in 3D source localization.

02

The dataset enables models to generalize well to other evaluation datasets.

03

Our models also accurately estimate room acoustics parameters across different datasets.

Abstract

We present Spatial LibriSpeech, a spatial audio dataset with over 650 hours of 19-channel audio, first-order ambisonics, and optional distractor noise. Spatial LibriSpeech is designed for machine learning model training, and it includes labels for source position, speaking direction, room acoustics and geometry. Spatial LibriSpeech is generated by augmenting LibriSpeech samples with 200k+ simulated acoustic conditions across 8k+ synthetic rooms. To demonstrate the utility of our dataset, we train models on four spatial audio tasks, resulting in a median absolute error of 6.60{\deg} on 3D source localization, 0.43m on distance, 90.66ms on T30, and 2.74dB on DRR estimation. We show that the same models generalize well to widely-used evaluation datasets, e.g., obtaining a median absolute error of 12.43{\deg} on 3D source localization on TUT Sound Events 2018, and 157.32ms on T30 estimation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

apple/ml-spatial-librispeech
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.