DPT: Deformable Patch-based Transformer for Visual Recognition

Zhiyang Chen; Yousong Zhu; Chaoyang Zhao; Guosheng Hu; Wei Zeng,; Jinqiao Wang; Ming Tang

arXiv:2107.14467·cs.CV·August 2, 2021

DPT: Deformable Patch-based Transformer for Visual Recognition

Zhiyang Chen, Yousong Zhu, Chaoyang Zhao, Guosheng Hu, Wei Zeng,, Jinqiao Wang, Ming Tang

PDF

1 Repo

TL;DR

This paper introduces Deformable Patch (DePatch), a module that adaptively splits images into patches with varying positions and scales, improving semantic preservation in vision transformers for classification and detection tasks.

Contribution

We propose the DePatch module that enables adaptive patch splitting, enhancing semantic retention and compatibility with various transformers for visual recognition.

Findings

01

Achieved 81.9% top-1 accuracy on ImageNet.

02

Attained 43.7% box mAP with RetinaNet on MSCOCO.

03

Achieved 44.3% mask mAP with Mask R-CNN on MSCOCO.

Abstract

Transformer has achieved great success in computer vision, while how to split patches in an image remains a problem. Existing methods usually use a fixed-size patch embedding which might destroy the semantics of objects. To address this problem, we propose a new Deformable Patch (DePatch) module which learns to adaptively split the images into patches with different positions and scales in a data-driven way rather than using predefined fixed patches. In this way, our method can well preserve the semantics in patches. The DePatch module can work as a plug-and-play module, which can easily be incorporated into different transformers to achieve an end-to-end training. We term this DePatch-embedded transformer as Deformable Patch-based Transformer (DPT) and conduct extensive evaluations of DPT on image classification and object detection. Results show DPT can achieve 81.9% top-1 accuracy on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

CASIA-IVA-Lab/DPT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Region Proposal Network · Feature Pyramid Network · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Convolution · Focal Loss