Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating

Dipan Maity; Suman Mondal; Arindam Roy

arXiv:2604.06014·cs.LG·April 13, 2026

Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating

Dipan Maity, Suman Mondal, Arindam Roy

PDF

TL;DR

Gated-SwinRMT introduces a hybrid vision transformer combining shifted-window attention with Manhattan-distance decay, enhanced by input-dependent gating, achieving improved accuracy on image classification benchmarks.

Contribution

It unifies Swin Transformer attention with Manhattan-distance decay and input-dependent gating, proposing two variants that outperform baseline models on Mini-ImageNet.

Findings

01

Gated-SwinRMT-SWAT achieves 80.22% accuracy on Mini-ImageNet.

02

Gated-SwinRMT-Retention achieves 78.20% accuracy on Mini-ImageNet.

03

The models show significant accuracy gains over the RMT baseline.

Abstract

We introduce Gated-SwinRMT, a family of hybrid vision transformers that combine the shifted-window attention of the Swin Transformer with the Manhattan-distance spatial decay of Retentive Networks (RMT), augmented by input-dependent gating. Self-attention is decomposed into consecutive width-wise and height-wise retention passes within each shifted window, where per-head exponential decay masks provide a two-dimensional locality prior without learned positional biases. Two variants are proposed.Gated-SwinRMT-SWAT substitutes softmax with sigmoid activation, implements balanced ALiBi slopes with multiplicative post-activation spatial decay, and gates the value projection via SwiGLU; the Normalized output implicitly suppresses uninformative attention scores. \textbf{Gated-SwinRMT-Retention} retains softmax-normalized retention with an additive log-space decay bias and incorporates an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.