Input-Adaptive Spectral Feature Compression by Sequence Modeling for Source Separation
Kohei Saijo, Yoshiaki Bando

TL;DR
This paper introduces Spectral Feature Compression (SFC), a novel, input-adaptive, and parameter-efficient method for frequency information compression in source separation, outperforming traditional band-split modules.
Contribution
The paper proposes SFC, a new sequence modeling approach that overcomes limitations of the band-split module by being input-adaptive and reducing parameters, with variants based on cross-attention and Mamba.
Findings
SFC outperforms the band-split module in MSS and CASS tasks.
SFC adaptively captures frequency patterns from input data.
SFC maintains performance across different separator sizes and compression ratios.
Abstract
Time-frequency domain dual-path models have demonstrated strong performance and are widely used in source separation. Because their computational cost grows with the number of frequency bins, these models often use the band-split (BS) module in high-sampling-rate tasks such as music source separation (MSS) and cinematic audio source separation (CASS). The BS encoder compresses frequency information by encoding features for each predefined subband. It achieves effective compression by introducing an inductive bias that places greater emphasis on low-frequency parts. Despite its success, the BS module has two inherent limitations: (i) it is not input-adaptive, preventing the use of input-dependent information, and (ii) the parameter count is large, since each subband requires a dedicated module. To address these issues, we propose Spectral Feature Compression (SFC). SFC compresses the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
