Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform

Xiangzhu Kong; Huang Hao; Zhijian Ou

arXiv:2506.11630·eess.AS·October 22, 2025·Interspeech

Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform

Xiangzhu Kong, Huang Hao, Zhijian Ou

PDF

Open Access

TL;DR

This paper introduces SHTNet, a lightweight, geometry-invariant multi-channel speech recognition framework that improves cross-array robustness and reduces computational complexity using spherical harmonic transforms and novel attention mechanisms.

Contribution

The paper proposes SHTNet, a novel framework combining spherical harmonic transforms and attention-based fusion to enhance multi-array speech recognition robustness and efficiency.

Findings

01

Achieves 39.26% average CER across diverse microphone arrays.

02

Reduces computations by 97.1% compared to traditional neural beamformers.

03

Demonstrates strong performance on multiple datasets including Aishell-4, Alimeeting, and XMOS.

Abstract

This paper presents SHTNet, a lightweight spherical harmonic transform (SHT) based framework, which is designed to address cross-array generalization challenges in multi-channel automatic speech recognition (ASR) through three key innovations. First, SHT based spatial sound field decomposition converts microphone signals into geometry-invariant spherical harmonic coefficients, isolating signal processing from array geometry. Second, the Spatio-Spectral Attention Fusion Network (SSAFN) combines coordinate-aware spatial modeling, refined self-attention channel combinator, and spectral noise suppression without conventional beamforming. Third, Rand-SHT training enhances robustness through random channel selection and array geometry reconstruction. The system achieves 39.26\% average CER across heterogeneous arrays (e.g., circular, square, and binaural) on datasets including Aishell-4,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis