HRFormer: High-Resolution Transformer for Dense Prediction

Yuhui Yuan; Rao Fu; Lang Huang; Weihong Lin; Chao Zhang; Xilin Chen,; Jingdong Wang

arXiv:2110.09408·cs.CV·November 9, 2021·126 cites

HRFormer: High-Resolution Transformer for Dense Prediction

Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen,, Jingdong Wang

PDF

Open Access 1 Repo

TL;DR

HRFormer introduces a high-resolution transformer architecture that efficiently learns detailed representations for dense prediction tasks, outperforming previous models in pose estimation and segmentation with fewer resources.

Contribution

It combines multi-resolution parallel design with local-window self-attention and convolutional feed-forward networks to improve efficiency and accuracy in dense prediction tasks.

Findings

01

Outperforms Swin transformer by 1.3 AP on COCO pose estimation.

02

Uses 50% fewer parameters and 30% fewer FLOPs than Swin.

03

Effective for human pose estimation and semantic segmentation.

Abstract

We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost. We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet), along with local-window self-attention that performs self-attention over small non-overlapping image windows, for improving the memory and computation efficiency. In addition, we introduce a convolution into the FFN to exchange information across the disconnected image windows. We demonstrate the effectiveness of the High-Resolution Transformer on both human pose estimation and semantic segmentation tasks, e.g., HRFormer outperforms Swin transformer by $1.3$ AP on COCO pose estimation with $50%$ fewer parameters…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HRNet/HRFormer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Stochastic Depth · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Residual Connection · Adam · Label Smoothing