Sliced Recursive Transformer
Zhiqiang Shen, Zechun Liu, Eric Xing

TL;DR
The paper introduces Sliced Recursive Transformer (SReT), a parameter-efficient vision transformer that shares weights across layers, improves accuracy, reduces computational costs, and enables scalable deep models with minimal overhead.
Contribution
It proposes a novel weight sharing recursive structure with sliced group self-attentions, enhancing efficiency and scalability of vision transformers without extra parameters.
Findings
Achieves ~2% accuracy gain on ImageNet-1K with recursive weight sharing.
Reduces computational cost by 10-30% using sliced group self-attentions.
Enables construction of very deep transformers with over 100 shared layers.
Abstract
We present a neat yet effective recursive operation on vision transformers that can improve parameter utilization without involving additional parameters. This is achieved by sharing weights across the depth of transformer networks. The proposed method can obtain a substantial gain (~2%) simply using naive recursive operation, requires no special or sophisticated knowledge for designing principles of networks, and introduces minimal computational overhead to the training procedure. To reduce the additional computation caused by recursive operation while maintaining the superior accuracy, we propose an approximating method through multiple sliced group self-attentions across recursive layers which can reduce the cost consumption by 10~30% with minimal performance loss. We call our model Sliced Recursive Transformer (SReT), a novel and parameter-efficient vision transformer design that is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Image Enhancement Techniques · CCD and CMOS Imaging Sensors
MethodsAttention Is All You Need · Linear Layer · Vision Transformer · Multi-Head Attention · Dropout · Layer Normalization · Residual Connection · Dense Connections · Softmax · Absolute Position Encodings
