TL;DR
This paper introduces TS-RIRGAN, a novel method to enhance synthetic room impulse responses by translating them into more realistic versions, significantly improving far-field speech recognition accuracy.
Contribution
The paper presents TS-RIRGAN, a new architecture that translates synthetic RIRs into more realistic ones, bridging the fidelity gap for speech augmentation.
Findings
Improved synthetic RIRs lead to up to 19.9% WER reduction.
Translation with TS-RIRGAN enhances RIR realism.
Method benefits far-field speech recognition performance.
Abstract
We present a method for improving the quality of synthetic room impulse responses for far-field speech recognition. We bridge the gap between the fidelity of synthetic room impulse responses (RIRs) and the real room impulse responses using our novel, TS-RIRGAN architecture. Given a synthetic RIR in the form of raw audio, we use TS-RIRGAN to translate it into a real RIR. We also perform real-world sub-band room equalization on the translated synthetic RIR. Our overall approach improves the quality of synthetic RIRs by compensating low-frequency wave effects, similar to those in real RIRs. We evaluate the performance of improved synthetic RIRs on a far-field speech dataset augmented by convolving the LibriSpeech clean speech dataset [1] with RIRs and adding background noise. We show that far-field speech augmented using our improved synthetic RIRs reduces the word error rate by up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
