Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating
Dipan Maity, Suman Mondal, Arindam Roy

TL;DR
Gated-SwinRMT introduces a hybrid vision transformer combining shifted-window attention with Manhattan-distance decay, enhanced by input-dependent gating, achieving improved accuracy on image classification benchmarks.
Contribution
It unifies Swin Transformer attention with Manhattan-distance decay and input-dependent gating, proposing two variants that outperform baseline models on Mini-ImageNet.
Findings
Gated-SwinRMT-SWAT achieves 80.22% accuracy on Mini-ImageNet.
Gated-SwinRMT-Retention achieves 78.20% accuracy on Mini-ImageNet.
The models show significant accuracy gains over the RMT baseline.
Abstract
We introduce Gated-SwinRMT, a family of hybrid vision transformers that combine the shifted-window attention of the Swin Transformer with the Manhattan-distance spatial decay of Retentive Networks (RMT), augmented by input-dependent gating. Self-attention is decomposed into consecutive width-wise and height-wise retention passes within each shifted window, where per-head exponential decay masks provide a two-dimensional locality prior without learned positional biases. Two variants are proposed.Gated-SwinRMT-SWAT substitutes softmax with sigmoid activation, implements balanced ALiBi slopes with multiplicative post-activation spatial decay, and gates the value projection via SwiGLU; the Normalized output implicitly suppresses uninformative attention scores. \textbf{Gated-SwinRMT-Retention} retains softmax-normalized retention with an additive log-space decay bias and incorporates an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
