Efficient Vocal Source Separation Through Windowed Sink Attention

Christodoulos Benetatos; Yongyi Zang; Randal Leistikow

arXiv:2510.25745·cs.SD·October 30, 2025

Efficient Vocal Source Separation Through Windowed Sink Attention

Christodoulos Benetatos, Yongyi Zang, Randal Leistikow

PDF

TL;DR

This paper introduces windowed sink attention (WSA), a localized attention mechanism for vocal source separation that significantly reduces computational costs while maintaining high performance.

Contribution

The paper proposes WSA, replacing full attention with localized windows and sinks, enabling efficient vocal separation with minimal performance loss.

Findings

01

Recovered 92% of original SDR performance after fine-tuning

02

Reduced FLOPs by 44.5 times compared to full attention

03

Achieved high-quality separation with lower computational cost

Abstract

State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunking and windowing approaches. Through analysis of a pre-trained vocal separation model, we discovered that temporal attention patterns are highly localized. Building on this insight, we replaced full attention with windowed sink attention (WSA) with small temporal attention window and attention sinks. We show empirically that fine-tuning from the original checkpoint recovers 92% of the original SDR performance while reducing FLOPs by 44.5x. We release our code and checkpoints under MIT license at https://github.com/smulelabs/windowed-roformer.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.