MMW: Side Talk Rejection Multi-Microphone Whisper on Smart Glasses
Yang Liu, Li Wan, Yiteng Huang, Yong Xu, yangyang shi, Saurabh Adya, ming sun, Florian Metze

TL;DR
This paper introduces a multi-microphone framework for smart glasses that effectively suppresses side speech interference, improving speech recognition accuracy in noisy environments by integrating novel waveform fusion, diarization, and optimization techniques.
Contribution
It presents a novel multi-microphone Whisper framework with innovative waveform fusion, frame diarization, and a multi-scale optimization strategy for side-talk rejection.
Findings
Reduces word error rate by 4.95% in noisy conditions.
Demonstrates effective multi-channel audio fusion at waveform level.
Enhances side-talk suppression through joint optimization.
Abstract
Smart glasses are increasingly positioned as the next-generation interface for ubiquitous access to large language models (LLMs). Nevertheless, achieving reliable interaction in real-world noisy environments remains a major challenge, particularly due to interference from side speech. In this work, we introduce a novel side-talk rejection multi-microphone Whisper (MMW) framework for smart glasses, incorporating three key innovations. First, we propose a Mix Block based on a Tri-Mamba architecture to effectively fuse multi-channel audio at the raw waveform level, while maintaining compatibility with streaming processing. Second, we design a Frame Diarization Mamba Layer to enhance frame-level side-talk suppression, facilitating more efficient fine-tuning of Whisper models. Third, we employ a Multi-Scale Group Relative Policy Optimization (GRPO) strategy to jointly optimize frame-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hearing Loss and Rehabilitation
