Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform
Xiangzhu Kong, Huang Hao, Zhijian Ou

TL;DR
This paper introduces SHTNet, a lightweight, geometry-invariant multi-channel speech recognition framework that improves cross-array robustness and reduces computational complexity using spherical harmonic transforms and novel attention mechanisms.
Contribution
The paper proposes SHTNet, a novel framework combining spherical harmonic transforms and attention-based fusion to enhance multi-array speech recognition robustness and efficiency.
Findings
Achieves 39.26% average CER across diverse microphone arrays.
Reduces computations by 97.1% compared to traditional neural beamformers.
Demonstrates strong performance on multiple datasets including Aishell-4, Alimeeting, and XMOS.
Abstract
This paper presents SHTNet, a lightweight spherical harmonic transform (SHT) based framework, which is designed to address cross-array generalization challenges in multi-channel automatic speech recognition (ASR) through three key innovations. First, SHT based spatial sound field decomposition converts microphone signals into geometry-invariant spherical harmonic coefficients, isolating signal processing from array geometry. Second, the Spatio-Spectral Attention Fusion Network (SSAFN) combines coordinate-aware spatial modeling, refined self-attention channel combinator, and spectral noise suppression without conventional beamforming. Third, Rand-SHT training enhances robustness through random channel selection and array geometry reconstruction. The system achieves 39.26\% average CER across heterogeneous arrays (e.g., circular, square, and binaural) on datasets including Aishell-4,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
