Evaluation of Deep Audio Representations for Hearables
Fabian Gr\"oger, Pascal Baumann, Ludovic Amruthalingam, Laurent Simon,, Ruksana Giurda, Simone Lionetti

TL;DR
This paper introduces DEAR, a new dataset and benchmark for evaluating foundation models' ability to capture acoustic properties relevant to hearable devices, demonstrating the superiority of the BEATs model in this context.
Contribution
The paper presents DEAR, the first dataset and benchmark specifically designed to evaluate foundation models for acoustic scene understanding in hearables, and shows BEATs' leading performance.
Findings
BEATs significantly outperforms other models on the benchmark
Diverse training data enhances model applicability to auditory tasks
DEAR enables systematic evaluation of audio representations for hearables
Abstract
Effectively steering hearable devices requires understanding the acoustic environment around the user. In the computational analysis of sound scenes, foundation models have emerged as the state of the art to produce high-performance, robust, multi-purpose audio representations. We introduce and release Deep Evaluation of Audio Representations (DEAR), the first dataset and benchmark to evaluate the efficacy of foundation models in capturing essential acoustic properties for hearables. The dataset includes 1,158 audio tracks, each 30 seconds long, created by spatially mixing proprietary monologues with commercial, high-quality recordings of everyday acoustic scenes. Our benchmark encompasses eight tasks that assess the general context, speech sources, and technical acoustic properties of the audio scenes. Through our evaluation of four general-purpose audio representation models, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing
