SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation
Duy-Kien Nguyen, Martin R. Oswald, Cees G. M. Snoek

TL;DR
SimPLR introduces a straightforward, non-hierarchical transformer architecture that effectively incorporates multi-scale information into attention mechanisms, achieving competitive performance in object detection and segmentation tasks with improved scalability and efficiency.
Contribution
The paper proposes a simple, plain transformer model that embeds multi-scale inductive bias into attention, eliminating the need for complex pyramid structures while maintaining high accuracy.
Findings
SimPLR achieves competitive accuracy with multi-scale vision transformers.
The model scales better with larger capacity and more data.
It offers faster runtime for detection and segmentation tasks.
Abstract
The ability to detect objects in images at varying scales has played a pivotal role in the design of modern object detectors. Despite considerable progress in removing hand-crafted components and simplifying the architecture with transformers, multi-scale feature maps and pyramid designs remain a key factor for their empirical success. In this paper, we show that shifting the multiscale inductive bias into the attention mechanism can work well, resulting in a plain detector `SimPLR' whose backbone and detection head are both non-hierarchical and operate on single-scale features. We find through our experiments that SimPLR with scale-aware attention is plain and simple architecture, yet competitive with multi-scale vision transformer alternatives. Compared to the multi-scale and single-scale state-of-the-art, our model scales better with bigger capacity (self-supervised) models and more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Dense Connections · Vision Transformer · Label Smoothing · Adam · Absolute Position Encodings
