RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech   Recognition in Multi-Channel Multi-Speaker Scenarios

Yiwen Shao; Shi-Xiong Zhang; Dong Yu

arXiv:2311.00146·eess.AS·June 13, 2024·1 cites

RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

Yiwen Shao, Shi-Xiong Zhang, Dong Yu

PDF

Open Access

TL;DR

This paper introduces RIR-SF, a novel spatial feature based on room impulse response that improves multi-channel multi-speaker speech recognition by effectively modeling reverberation and reflections, outperforming traditional spatial features.

Contribution

The paper presents RIR-SF, a new RIR-based spatial feature, and an all-neural ASR framework that together significantly enhance recognition accuracy in reverberant multi-talker environments.

Findings

01

RIR-SF outperforms traditional 3D spatial features in reverberant conditions.

02

The proposed ASR framework achieves a 21.3% relative CER reduction.

03

RIR-SF demonstrates robustness in high-reverberation scenarios.

Abstract

Automatic speech recognition (ASR) on multi-talker recordings is challenging. Current methods using 3D spatial data from multi-channel audio and visual cues focus mainly on direct waves from the target speaker, overlooking reflection wave impacts, which hinders performance in reverberant environments. Our research introduces RIR-SF, a novel spatial feature based on room impulse response (RIR) that leverages the speaker's position, room acoustics, and reflection dynamics. RIR-SF significantly outperforms traditional 3D spatial features, showing superior theoretical and empirical performance. We also propose an optimized all-neural multi-channel ASR framework for RIR-SF, achieving a relative 21.3\% reduction in CER for target speaker ASR in multi-channel settings. RIR-SF enhances recognition accuracy and demonstrates robustness in high-reverberation scenarios, overcoming the limitations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Indoor and Outdoor Localization Technologies

MethodsFocus · Convolution