QKVA grid: Attention in Image Perspective and Stacked DETR

Wenyuan Sheng

arXiv:2207.04313·cs.CV·August 17, 2022

QKVA grid: Attention in Image Perspective and Stacked DETR

Wenyuan Sheng

PDF

Open Access 1 Repo

TL;DR

This paper introduces SDETR, an improved version of DETR with a novel QKVA grid attention mechanism and stacked architecture, resulting in better performance, especially on small objects, while simplifying training.

Contribution

The paper proposes the QKVA grid for a new perspective on attention and introduces a stacked architecture to enhance DETR's performance and training efficiency.

Findings

01

SDETR achieves +0.6 AP improvement over DETR.

02

SDETR outperforms Faster R-CNN on small objects.

03

The QKVA grid clarifies attention mechanisms in image tasks.

Abstract

We present a new model named Stacked-DETR(SDETR), which inherits the main ideas in canonical DETR. We improve DETR in two directions: simplifying the cost of training and introducing the stacked architecture to enhance the performance. To the former, we focus on the inside of the Attention block and propose the QKVA grid, a new perspective to describe the process of attention. By this, we can step further on how Attention works for image problems and the effect of multi-head. These two ideas contribute the design of single-head encoder-layer. To the latter, SDETR reaches better performance(+0.6AP, +2.7APs) to DETR. Especially to the performance on small objects, SDETR achieves better results to the optimized Faster R-CNN baseline, which was a shortcoming in DETR. Our changes are based on the code of DETR. Training code and pretrained models are available at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shengwenyuan/sdetr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Human Pose and Action Recognition · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Multi-Head Attention · Layer Normalization