RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding   using Contrastive Learning with Application to Room Shape Classification

Jacob Bitterman; Daniel Levi; Hilel Hagai Diamandi; Sharon Gannot; Tal; Rosenwein

arXiv:2406.03120·eess.AS·June 6, 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification

Jacob Bitterman, Daniel Levi, Hilel Hagai Diamandi, Sharon Gannot, Tal, Rosenwein

PDF

Open Access

TL;DR

This paper introduces a dual-encoder contrastive learning approach to extract room and reverberant speech embeddings for room shape classification, enabling room fingerprinting directly from speech signals.

Contribution

It proposes a novel joint embedding framework using contrastive learning to estimate room parameters from speech without needing explicit RIR measurements.

Findings

01

Effective in simulated environments for room shape classification

02

Outperforms baseline methods in embedding quality

03

Enables room fingerprinting directly from speech signals

Abstract

This paper focuses on room fingerprinting, a task involving the analysis of an audio recording to determine the specific volume and shape of the room in which it was captured. While it is relatively straightforward to determine the basic room parameters from the Room Impulse Responses (RIR), doing so from a speech signal is a cumbersome task. To address this challenge, we introduce a dual-encoder architecture that facilitates the estimation of room parameters directly from speech utterances. During pre-training, one encoder receives the RIR while the other processes the reverberant speech signal. A contrastive loss function is employed to embed the speech and the acoustic response jointly. In the fine-tuning stage, the specific classification task is trained. In the test phase, only the reverberant utterance is available, and its embedding is used for the task of room shape…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing