Multi-channel multi-speaker transformer for speech recognition

Guo Yifan; Tian Yao; Suo Hongbin; Wan Yulong

arXiv:2601.02688·cs.SD·January 7, 2026

Multi-channel multi-speaker transformer for speech recognition

Guo Yifan, Tian Yao, Suo Hongbin, Wan Yulong

PDF

Open Access

TL;DR

This paper introduces M2Former, a novel multi-channel transformer model designed for far-field multi-speaker speech recognition, outperforming existing methods significantly in reducing word error rates.

Contribution

The paper proposes M2Former, a new multi-channel transformer architecture that effectively encodes high-dimensional acoustic features for each speaker in mixed audio environments.

Findings

01

M2Former outperforms neural beamformer and other models in WER reduction.

02

Achieves up to 52.2% relative WER reduction on SMS-WSJ.

03

Demonstrates effectiveness in far-field multi-speaker ASR scenarios.

Abstract

With the development of teleconferencing and in-vehicle voice assistants, far-field multi-speaker speech recognition has become a hot research topic. Recently, a multi-channel transformer (MCT) has been proposed, which demonstrates the ability of the transformer to model far-field acoustic environments. However, MCT cannot encode high-dimensional acoustic features for each speaker from mixed input audio because of the interference between speakers. Based on these, we propose the multi-channel multi-speaker transformer (M2Former) for far-field multi-speaker ASR in this paper. Experiments on the SMS-WSJ benchmark show that the M2Former outperforms the neural beamformer, MCT, dual-path RNN with transform-average-concatenate and multi-channel deep clustering based end-to-end systems by 9.2%, 14.3%, 24.9%, and 52.2% respectively, in terms of relative word error rate reduction.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders