HRFormer: High-Resolution Transformer for Dense Prediction
Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen,, Jingdong Wang

TL;DR
HRFormer introduces a high-resolution transformer architecture that efficiently learns detailed representations for dense prediction tasks, outperforming previous models in pose estimation and segmentation with fewer resources.
Contribution
It combines multi-resolution parallel design with local-window self-attention and convolutional feed-forward networks to improve efficiency and accuracy in dense prediction tasks.
Findings
Outperforms Swin transformer by 1.3 AP on COCO pose estimation.
Uses 50% fewer parameters and 30% fewer FLOPs than Swin.
Effective for human pose estimation and semantic segmentation.
Abstract
We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost. We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet), along with local-window self-attention that performs self-attention over small non-overlapping image windows, for improving the memory and computation efficiency. In addition, we introduce a convolution into the FFN to exchange information across the disconnected image windows. We demonstrate the effectiveness of the High-Resolution Transformer on both human pose estimation and semantic segmentation tasks, e.g., HRFormer outperforms Swin transformer by AP on COCO pose estimation with fewer parameters…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Stochastic Depth · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Residual Connection · Adam · Label Smoothing
