MANNER: Multi-view Attention Network for Noise Erasure
Hyun Joon Park, Byung Ha Kang, Wooseok Shin, Jin Sob Kim, Sung Won Han

TL;DR
MANNER is a novel multi-view attention network that enhances noisy speech in the time domain, achieving state-of-the-art results by efficiently extracting multiple representations for noise erasure.
Contribution
It introduces a multi-view attention block within a convolutional encoder-decoder for improved noise removal in speech, addressing limitations of previous dual-path models.
Findings
Achieves state-of-the-art performance on VoiceBank-DEMAND dataset.
Efficiently processes noisy speech with high-quality output.
Outperforms existing methods in objective speech quality metrics.
Abstract
In the field of speech enhancement, time domain methods have difficulties in achieving both high performance and efficiency. Recently, dual-path models have been adopted to represent long sequential features, but they still have limited representations and poor memory efficiency. In this study, we propose Multi-view Attention Network for Noise ERasure (MANNER) consisting of a convolutional encoder-decoder with a multi-view attention block, applied to the time-domain signals. MANNER efficiently extracts three different representations from noisy speech and estimates high-quality clean speech. We evaluated MANNER on the VoiceBank-DEMAND dataset in terms of five objective speech quality metrics. Experimental results show that MANNER achieves state-of-the-art performance while efficiently processing noisy speech.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques
