Efficient Vocal Source Separation Through Windowed Sink Attention
Christodoulos Benetatos, Yongyi Zang, Randal Leistikow

TL;DR
This paper introduces windowed sink attention (WSA), a localized attention mechanism for vocal source separation that significantly reduces computational costs while maintaining high performance.
Contribution
The paper proposes WSA, replacing full attention with localized windows and sinks, enabling efficient vocal separation with minimal performance loss.
Findings
Recovered 92% of original SDR performance after fine-tuning
Reduced FLOPs by 44.5 times compared to full attention
Achieved high-quality separation with lower computational cost
Abstract
State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunking and windowing approaches. Through analysis of a pre-trained vocal separation model, we discovered that temporal attention patterns are highly localized. Building on this insight, we replaced full attention with windowed sink attention (WSA) with small temporal attention window and attention sinks. We show empirically that fine-tuning from the original checkpoint recovers 92% of the original SDR performance while reducing FLOPs by 44.5x. We release our code and checkpoints under MIT license at https://github.com/smulelabs/windowed-roformer.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
