TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation
Mohan Xu, Kai Li, Guo Chen, Xiaolin Hu

TL;DR
The paper introduces TIGER, an efficient speech separation model that reduces computational costs significantly while outperforming state-of-the-art models, and presents EchoSet, a new dataset for realistic acoustic environment evaluation.
Contribution
TIGER is a novel low-parameter, low-complexity speech separation model utilizing frequency band division and attention modules, validated on a new realistic dataset EchoSet.
Findings
TIGER reduces parameters by 94.3% and MACs by 95.3%.
TIGER outperforms SOTA models on EchoSet and real-world data.
EchoSet provides a more realistic benchmark for speech separation.
Abstract
In recent years, much speech separation research has focused primarily on improving model performance. However, for low-latency speech processing systems, high efficiency is equally important. Therefore, we propose a speech separation model with significantly reduced parameters and computational costs: Time-frequency Interleaved Gain Extraction and Reconstruction network (TIGER). TIGER leverages prior knowledge to divide frequency bands and compresses frequency information. We employ a multi-scale selective attention module to extract contextual features while introducing a full-frequency-frame attention module to capture both temporal and frequency contextual information. Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a dataset called EchoSet. This dataset includes noise and more realistic…
Peer Reviews
Decision·ICLR 2025 Poster
Overall the paper is well-written and easy to follow. The proposed model architecture looks reasonably nice with the introduction of a new band-split strategy and a new FFI block. Experimental results show that the model not only archieves competitive performance on all datasets, but also it is very lightweight in terms of model sizes, MACs. Its efficiency in training (GPU time & memory) and inference (CPU time, GPU time & memory) looks good, too. The motivation of creating EchoSet dataset is cl
The FFI block follows a common design of dual-path architecture, it consists of 2 different parts: frequency path and frame path. Each path has two main modules: multi-scale selective attention (MSA) and full-frequency-frame attention (F^3A). While F^3A looks familiar with self-attention mechanism, MSA extracts features through a selective attention mechanism at multiple scales. We may need an ablation study of the MSA architecture with different scales (e.g. 1,2,3,4) to see how it affects model
The paper is well written and the model is described in a detailed manner even though there may be room for improvement in the model description for more clarity. The method applies to many different sampling rate data due to bandsplit RNN based encoder/decoder. The model is also used to perform cinematic audio separation at 44.1 kHz. The ablations and various comparisons with state-of-the-art on multiple relevant datasets are all appropriate and impressive.
Loss function was not mentioned in the main text (or I missed it). Is the loss in (10) in the appendix used for all tasks, or only for cinematic sound separation? The math in the MSA module description gets a bit hard to follow, so maybe a more detailed Figure that tracks along with mathematical equations would help. How does selective attention (SA) work? It was not described in the paper.
1) The authors' provision of open-source code enables researchers in the field to reproduce and build upon this work. 2) The authors have a detailed ablation study, as it clarifies the impact and effectiveness of each proposed module.
1) The proposed approach does not demonstrate clear performance advantages over the current SOTA method, with a noticeable performance gap. 2) Although the authors claimed their approach is more lightweight, the comparison is not entirely fair. A comparison with other systems that specifically employ lightweight methods would provide a more accurate assessment of the model's efficiency.
Code & Models
Videos
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques
MethodsSoftmax · Attention Is All You Need
