Study of Lightweight Transformer Architectures for Single-Channel Speech Enhancement

Haixin Zhao; Nilesh Madhu

arXiv:2505.21057·eess.AS·January 30, 2026

Study of Lightweight Transformer Architectures for Single-Channel Speech Enhancement

Haixin Zhao, Nilesh Madhu

PDF

Open Access

TL;DR

This paper introduces a lightweight, transformer-based speech enhancement model that achieves state-of-the-art performance with significantly fewer parameters and computational requirements, suitable for edge devices.

Contribution

The paper proposes a novel streamlined FTF transformer architecture with adversarial training, reducing complexity while maintaining or improving performance over existing models.

Findings

01

LCT-GAN requires only 6% of DeepFilterNet2's parameters with similar performance.

02

LCT-GAN saves 9% parameters and 10% multiply-accumulate operations compared to CCFNet+(Lite).

03

LCT-GAN outperforms more complex baseline models on standard datasets.

Abstract

In speech enhancement, achieving state-of-the-art (SotA) performance while adhering to the computational constraints on edge devices remains a formidable challenge. Networks integrating stacked temporal and spectral modelling effectively leverage improved architectures such as transformers; however, they inevitably incur substantial computational complexity and model expansion. Through systematic ablation analysis on transformer-based temporal and spectral modelling, we demonstrate that the architecture employing streamlined Frequency-Time-Frequency (FTF) stacked transformers efficiently learns global dependencies within causal context, while avoiding considerable computational demands. Utilising discriminators in training further improves learning efficacy and enhancement without introducing additional complexity during inference. The proposed lightweight, causal, transformer-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques