MMW: Side Talk Rejection Multi-Microphone Whisper on Smart Glasses

Yang Liu; Li Wan; Yiteng Huang; Yong Xu; yangyang shi; Saurabh Adya; ming sun; Florian Metze

arXiv:2507.05609·eess.AS·July 9, 2025

MMW: Side Talk Rejection Multi-Microphone Whisper on Smart Glasses

Yang Liu, Li Wan, Yiteng Huang, Yong Xu, yangyang shi, Saurabh Adya, ming sun, Florian Metze

PDF

Open Access

TL;DR

This paper introduces a multi-microphone framework for smart glasses that effectively suppresses side speech interference, improving speech recognition accuracy in noisy environments by integrating novel waveform fusion, diarization, and optimization techniques.

Contribution

It presents a novel multi-microphone Whisper framework with innovative waveform fusion, frame diarization, and a multi-scale optimization strategy for side-talk rejection.

Findings

01

Reduces word error rate by 4.95% in noisy conditions.

02

Demonstrates effective multi-channel audio fusion at waveform level.

03

Enhances side-talk suppression through joint optimization.

Abstract

Smart glasses are increasingly positioned as the next-generation interface for ubiquitous access to large language models (LLMs). Nevertheless, achieving reliable interaction in real-world noisy environments remains a major challenge, particularly due to interference from side speech. In this work, we introduce a novel side-talk rejection multi-microphone Whisper (MMW) framework for smart glasses, incorporating three key innovations. First, we propose a Mix Block based on a Tri-Mamba architecture to effectively fuse multi-channel audio at the raw waveform level, while maintaining compatibility with streaming processing. Second, we design a Frame Diarization Mamba Layer to enhance frame-level side-talk suppression, facilitating more efficient fine-tuning of Whisper models. Third, we employ a Multi-Scale Group Relative Policy Optimization (GRPO) strategy to jointly optimize frame-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hearing Loss and Rehabilitation