Real-Time Speech Enhancement via a Hybrid ViT: A Dual-Input Acoustic-Image Feature Fusion
Behnaz Bahmei, Siamak Arzanpour, Elina Birmingham

TL;DR
This paper introduces a hybrid ViT-based dual-input framework for real-time speech enhancement that effectively models temporal and spectral features, improving noise suppression in non-stationary noisy environments on embedded devices.
Contribution
It proposes a novel hybrid ViT model with dual-input acoustic-image feature fusion for real-time, non-stationary noise suppression in speech signals, suitable for embedded systems.
Findings
Significant improvements in PESQ, STOI, Seg SNR, and LLR metrics.
Enhanced speech intelligibility and perceptual quality in noisy environments.
Performance close to clean speech references.
Abstract
Speech quality and intelligibility are significantly degraded in noisy environments. This paper presents a novel transformer-based learning framework to address the single-channel noise suppression problem for real-time applications. Although existing deep learning networks have shown remarkable improvements in handling stationary noise, their performance often diminishes in real-world environments characterized by non-stationary noise (e.g., dog barking, baby crying). The proposed dual-input acoustic-image feature fusion using a hybrid ViT framework effectively models both temporal and spectral dependencies in noisy signals. Designed for real-world audio environments, the proposed framework is computationally lightweight and suitable for implementation on embedded devices. To evaluate its effectiveness, four standard and commonly used quality measurements, namely PESQ, STOI, Seg SNR,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Advanced Adaptive Filtering Techniques
