Deep Active Speech Cancellation with Mamba-Masking Network
Yehuda Mishaly, Lior Wolf, Eliya Nachmani

TL;DR
This paper introduces Mamba-Masking, a deep learning network for Active Speech Cancellation that effectively cancels noise and speech signals, outperforming existing methods with significant performance improvements.
Contribution
The paper presents a novel Mamba-Masking architecture with adaptive masking and multi-band segmentation for enhanced speech and noise cancellation.
Findings
Up to 7.2dB improvement in ANC scenarios
Up to 6.2dB improvement in ASC scenarios
Significant outperforming of existing methods
Abstract
We present a novel deep learning network for Active Speech Cancellation (ASC), advancing beyond Active Noise Cancellation (ANC) methods by effectively canceling both noise and speech signals. The proposed Mamba-Masking architecture introduces a masking mechanism that directly interacts with the encoded reference signal, enabling adaptive and precisely aligned anti-signal generation-even under rapidly changing, high-frequency conditions, as commonly found in speech. Complementing this, a multi-band segmentation strategy further improves phase alignment across frequency bands. Additionally, we introduce an optimization-driven loss function that provides near-optimal supervisory signals for anti-signal generation. Experimental results demonstrate substantial performance gains, achieving up to 7.2dB improvement in ANC scenarios and 6.2dB in ASC, significantly outperforming existing methods.
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Achieves performance improvements over baseline model such as ARN 2. Comprehensive ablation studies. Tables 9 and 10 systematically evaluate various settings such as Mamba vs LSTM vs Transformer, single-band vs multi-band, masking mechanism, dual-path structure, NOAS optimization. 3. Author shows their method is computational more efficient and also achieves better performance. 4. The real-word simulation is helpful to justify the effectiveness of the model.
One claim of the paper is questionable: "This is the first work to actively cancel both noise and speech using deep learning". However their own Table 7 shows DeepANC achieves ~8.56 dB on speech and ARN achieves ~10.31 dB. These are significant cancellation results. Previous methods can cancel speech, they just don't do it as well.
1) Integrates state-space Mamba layers with a multi-band masking architecture in an encoder–masker–decoder pipeline to target phase alignment and high-frequency dynamics in speech, which is a coherent architectural direction for active cancellation tasks. 2) Proposes a two-stage training with a “near-optimal anti-signal” target that projects supervision through the secondary path, a conceptually consistent way to mitigate misleading gradients from acoustic-path mismatch. 3) Presents broad empi
1) Novelty and positioning are unclear: the work is framed as the first to actively cancel both noise and speech with deep learning, but the paper neither cleanly delineates ASC from stronger ANC/beamforming/speech enhancement/wavefield control, nor demonstrates clear conceptual advances beyond “a stronger ANC system that also targets speech.” 2) Real-time causality inconsistency. the theoretical latency bound is stringent, yet reported runtimes exceed that bound and rely on future-frame predict
1. Acoustic scenario: While existing literature concentrate on the active noise cancellation, this paper also considers the speech cancellation scenario, and such a case is seldomly tackled in previous DNN-based works. 2. Architecture: This work is the first to adopt Mamba as the modeling block, and a subband-based design is proposed, which achieves better performance than previous works. 3. This paper analyzes the limitation of existing ANC loss, and proposes a near optimal anti-signal loss,
1. While this paper generalizes the original ANC scenario into the ASC, I think it belongs to the novely in acoustic setting, rather than in machine learning problem solving. Thus, I think it may be more suitable for acoustic/speech journals, e.g., JASA or TASLP. 2. While ANC has direct applications, I am not quite clear about the concrete applications of ASC, and I am a little curious whether this scenario setting is practical. 3. The motivaiton of using Mamba structure seems unclear. 4. While
The experimental design is sound and adequately validates the main methodology. The application of the state space model (Mamba) in the context of active audio or noise cancellation appears to be a novel contribution. The presentation is good, where the structure is well-organized, data are clearly presented through tables and figures, and technical details are sufficiently elaborated.
The work lacks significant originality: - The Mamba-Masking mechanism adapts existing Mamba models rather than introducing a novel design. - Multi-band processing is a commonly used technique in audio-related tasks. - And the NOAS loss function merely refines existing formulations without addressing previously unsolved problems. The proposed optimization strategy first derives an intermediate target y' based on the reference signal x, and then obtains the final output y based on y'. Essentially,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Wireless Communication Networks Research
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces
