TridentSE: Guiding Speech Enhancement with 32 Global Tokens
Dacheng Yin, Zhiyuan Zhao, Chuanxin Tang, Zhiwei Xiong, Chong Luo

TL;DR
TridentSE introduces a novel speech enhancement architecture that efficiently captures global and local information using global tokens and cross attention, achieving high perceptual quality with lower computational cost.
Contribution
The paper proposes TridentSE, which combines local T-F bin representations with global tokens processed via cross attention for improved speech enhancement.
Findings
Achieves PESQ of 3.47 on VoiceBank+DEMAND
Achieves PESQ of 3.44 on DNS no-reverb
Outperforms previous methods with lower computational cost
Abstract
In this paper, we present TridentSE, a novel architecture for speech enhancement, which is capable of efficiently capturing both global information and local details. TridentSE maintains T-F bin level representation to capture details, and uses a small number of global tokens to process the global information. Information is propagated between the local and the global representations through cross attention modules. To capture both inter- and intra-frame information, the global tokens are divided into two groups to process along the time and the frequency axis respectively. A metric discriminator is further employed to guide our model to achieve higher perceptual quality. Even with significantly lower computational cost, TridentSE outperforms a variety of previous speech enhancement methods, achieving a PESQ of 3.47 on VoiceBank+DEMAND dataset and a PESQ of 3.44 on DNS no-reverb test…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Infant Health and Development · Speech Recognition and Synthesis
MethodsTest
