DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation

Shuaitao Zhao; Kun Liu; Yuhang Huang; Qian Bao; Dan Zeng; and Wu Liu

arXiv:2209.02431·cs.CV·September 7, 2022

DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation

Shuaitao Zhao, Kun Liu, Yuhang Huang, Qian Bao, Dan Zeng, and Wu Liu

PDF

Open Access

TL;DR

DPIT introduces a dual-pipeline transformer approach that combines top-down and bottom-up methods for improved human pose estimation, effectively handling occlusion and scale variation by fusing global and local visual cues.

Contribution

This work is the first to integrate top-down and bottom-up pipelines with transformers for human pose estimation, enhancing accuracy and robustness.

Findings

01

Achieves state-of-the-art performance on COCO and MPII datasets.

02

Effectively handles occlusion and scale variation issues.

03

Demonstrates the benefit of combining global and local features via transformers.

Abstract

Human pose estimation aims to figure out the keypoints of all people in different scenes. Current approaches still face some challenges despite promising results. Existing top-down methods deal with a single person individually, without the interaction between different people and the scene they are situated in. Consequently, the performance of human detection degrades when serious occlusion happens. On the other hand, existing bottom-up methods consider all people at the same time and capture the global knowledge of the entire image. However, they are less accurate than the top-down methods due to the scale variation. To address these problems, we propose a novel Dual-Pipeline Integrated Transformer (DPIT) by integrating top-down and bottom-up pipelines to explore the visual clues of different receptive fields and achieve their complementarity. Specifically, DPIT consists of two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Hand Gesture Recognition Systems

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Absolute Position Encodings · Residual Connection · Position-Wise Feed-Forward Layer · Adam · Dense Connections