Pattern Attention Transformer with Doughnut Kernel

WenYuan Sheng

arXiv:2211.16961·cs.CV·September 19, 2023

Pattern Attention Transformer with Doughnut Kernel

WenYuan Sheng

PDF

Open Access

TL;DR

The paper introduces the Pattern Attention Transformer (PAT) with a novel doughnut kernel that improves image classification efficiency and accuracy by enhancing patch design and kernel shape, outperforming Swin Transformer on ImageNet 1K.

Contribution

It proposes a new doughnut kernel for Transformers, enabling more efficient patch processing and higher performance in image classification tasks.

Findings

01

Higher throughput (+10%) on ImageNet 1K

02

Surpasses Swin Transformer by +0.8 accuracy

03

Uses lighter architecture with only one pattern attention layer per stage

Abstract

We present in this paper a new architecture, the Pattern Attention Transformer (PAT), that is composed of the new doughnut kernel. Compared with tokens in the NLP field, Transformer in computer vision has the problem of handling the high resolution of pixels in images. In ViT, an image is cut into square-shaped patches. As the follow-up of ViT, Swin Transformer proposes an additional step of shifting to decrease the existence of fixed boundaries, which also incurs 'two connected Swin Transformer blocks' as the minimum unit of the model. Inheriting the patch/window idea, our doughnut kernel enhances the design of patches further. It replaces the line-cut boundaries with two types of areas: sensor and updating, which is based on the comprehension of self-attention (named QKVA grid). The doughnut kernel also brings a new topic about the shape of kernels beyond square. To verify its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications · Advanced Image and Video Retrieval Techniques · CCD and CMOS Imaging Sensors

MethodsMulti-Head Attention · Attention Is All You Need · Stochastic Depth · Layer Normalization · Adam · Absolute Position Encodings · Softmax · Dropout · Byte Pair Encoding · Swin Transformer