SpatialEmb: Extract and Encode Spatial Information for 1-Stage Multi-channel Multi-speaker ASR on Arbitrary Microphone Arrays
Yiwen Shao, Yong Xu, Sanjeev Khudanpur, Dong Yu

TL;DR
This paper introduces SpatialEmb, a lightweight module that directly encodes spatial information into multi-channel ASR systems, improving efficiency and robustness across arbitrary microphone arrays, and achieves state-of-the-art results on AliMeeting.
Contribution
SpatialEmb enables direct spatial information encoding for multi-channel ASR, reducing pipeline complexity and improving adaptability to different microphone topologies.
Findings
Achieves 17.04% CER on Eval set and 20.32% CER on Test set.
Supports both fixed and arbitrary microphone topologies.
Establishes new state-of-the-art results on AliMeeting.
Abstract
Spatial information is a critical clue for multi-channel multi-speaker target speech recognition. Most state-of-the-art multi-channel Automatic Speech Recognition (ASR) systems extract spatial features only during the speech separation stage, followed by standard single-channel ASR on the separated speech. This approach results in an inefficient, lengthy pipeline and sub-optimal ASR performance due to the accumulated errors from preprocessing modules. Furthermore, most spatial feature extraction methods depend on the knowledge of speaker positions and microphone topology, making the systems reliant on specific settings and challenging to adapt to new equipment. In this work, we propose a solution to these issues with a lightweight embedding module named SpatialEmb, which extracts and encodes spatial information directly for the ASR model, supporting both fixed and arbitrary microphone…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
