Pattern Attention Transformer with Doughnut Kernel
WenYuan Sheng

TL;DR
The paper introduces the Pattern Attention Transformer (PAT) with a novel doughnut kernel that improves image classification efficiency and accuracy by enhancing patch design and kernel shape, outperforming Swin Transformer on ImageNet 1K.
Contribution
It proposes a new doughnut kernel for Transformers, enabling more efficient patch processing and higher performance in image classification tasks.
Findings
Higher throughput (+10%) on ImageNet 1K
Surpasses Swin Transformer by +0.8 accuracy
Uses lighter architecture with only one pattern attention layer per stage
Abstract
We present in this paper a new architecture, the Pattern Attention Transformer (PAT), that is composed of the new doughnut kernel. Compared with tokens in the NLP field, Transformer in computer vision has the problem of handling the high resolution of pixels in images. In ViT, an image is cut into square-shaped patches. As the follow-up of ViT, Swin Transformer proposes an additional step of shifting to decrease the existence of fixed boundaries, which also incurs 'two connected Swin Transformer blocks' as the minimum unit of the model. Inheriting the patch/window idea, our doughnut kernel enhances the design of patches further. It replaces the line-cut boundaries with two types of areas: sensor and updating, which is based on the comprehension of self-attention (named QKVA grid). The doughnut kernel also brings a new topic about the shape of kernels beyond square. To verify its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications · Advanced Image and Video Retrieval Techniques · CCD and CMOS Imaging Sensors
MethodsMulti-Head Attention · Attention Is All You Need · Stochastic Depth · Layer Normalization · Adam · Absolute Position Encodings · Softmax · Dropout · Byte Pair Encoding · Swin Transformer
