Real-Time Speech Enhancement via a Hybrid ViT: A Dual-Input Acoustic-Image Feature Fusion

Behnaz Bahmei; Siamak Arzanpour; Elina Birmingham

arXiv:2511.11825·cs.SD·November 18, 2025

Real-Time Speech Enhancement via a Hybrid ViT: A Dual-Input Acoustic-Image Feature Fusion

Behnaz Bahmei, Siamak Arzanpour, Elina Birmingham

PDF

Open Access

TL;DR

This paper introduces a hybrid ViT-based dual-input framework for real-time speech enhancement that effectively models temporal and spectral features, improving noise suppression in non-stationary noisy environments on embedded devices.

Contribution

It proposes a novel hybrid ViT model with dual-input acoustic-image feature fusion for real-time, non-stationary noise suppression in speech signals, suitable for embedded systems.

Findings

01

Significant improvements in PESQ, STOI, Seg SNR, and LLR metrics.

02

Enhanced speech intelligibility and perceptual quality in noisy environments.

03

Performance close to clean speech references.

Abstract

Speech quality and intelligibility are significantly degraded in noisy environments. This paper presents a novel transformer-based learning framework to address the single-channel noise suppression problem for real-time applications. Although existing deep learning networks have shown remarkable improvements in handling stationary noise, their performance often diminishes in real-world environments characterized by non-stationary noise (e.g., dog barking, baby crying). The proposed dual-input acoustic-image feature fusion using a hybrid ViT framework effectively models both temporal and spectral dependencies in noisy signals. Designed for real-world audio environments, the proposed framework is computationally lightweight and suitable for implementation on embedded devices. To evaluate its effectiveness, four standard and commonly used quality measurements, namely PESQ, STOI, Seg SNR,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Advanced Adaptive Filtering Techniques