TF-Locoformer: Transformer with Local Modeling by Convolution for Speech   Separation and Enhancement

Kohei Saijo; Gordon Wichern; Fran\c{c}ois G. Germain; Zexu Pan,; Jonathan Le Roux

arXiv:2408.03440·eess.AS·August 8, 2024

TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Kohei Saijo, Gordon Wichern, Fran\c{c}ois G. Germain, Zexu Pan,, Jonathan Le Roux

PDF

Open Access 1 Repo

TL;DR

TF-Locoformer is a Transformer-based speech separation and enhancement model that replaces RNNs with convolutional feed-forward networks to improve local modeling while maintaining state-of-the-art performance.

Contribution

This work introduces TF-Locoformer, a novel RNN-free Transformer architecture with convolutional local modeling for speech tasks, achieving state-of-the-art results.

Findings

01

Outperforms existing models on multiple benchmarks.

02

Maintains high fidelity in speech separation and enhancement.

03

Achieves comparable or better results than RNN-based models.

Abstract

Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation. While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they lack the parallelizability, scalability, and versatility of Transformer blocks. Given the wide-ranging success of pure Transformer-based architectures in other fields, in this work we focus on removing the RNN from TF-domain dual-path models, while maintaining SoTA performance. This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution. The model uses feed-forward networks (FFNs) with convolution layers, instead of linear layers, to capture local information, letting the self-attention focus on capturing global patterns. We place two such FFNs before and after self-attention to enhance the local-modeling capability. We also introduce a novel normalization for TF-domain dual-path…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

merlresearch/tf-locoformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis

MethodsLinear Layer · Residual Connection · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Convolution · Softmax · Absolute Position Encodings