TL;DR
This paper introduces Deformable Patch (DePatch), a module that adaptively splits images into patches with varying positions and scales, improving semantic preservation in vision transformers for classification and detection tasks.
Contribution
We propose the DePatch module that enables adaptive patch splitting, enhancing semantic retention and compatibility with various transformers for visual recognition.
Findings
Achieved 81.9% top-1 accuracy on ImageNet.
Attained 43.7% box mAP with RetinaNet on MSCOCO.
Achieved 44.3% mask mAP with Mask R-CNN on MSCOCO.
Abstract
Transformer has achieved great success in computer vision, while how to split patches in an image remains a problem. Existing methods usually use a fixed-size patch embedding which might destroy the semantics of objects. To address this problem, we propose a new Deformable Patch (DePatch) module which learns to adaptively split the images into patches with different positions and scales in a data-driven way rather than using predefined fixed patches. In this way, our method can well preserve the semantics in patches. The DePatch module can work as a plug-and-play module, which can easily be incorporated into different transformers to achieve an end-to-end training. We term this DePatch-embedded transformer as Deformable Patch-based Transformer (DPT) and conduct extensive evaluations of DPT on image classification and object detection. Results show DPT can achieve 81.9% top-1 accuracy on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Region Proposal Network · Feature Pyramid Network · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Convolution · Focal Loss
