ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation
Nanjun Li, Pinqi Cheng, Zean Liu, Minghe Tian, Xuanyin Wang

TL;DR
ER-Pose introduces a keypoint-driven framework for real-time multi-person pose estimation that removes bounding-box constraints, leading to significant accuracy improvements and higher efficiency compared to traditional box-driven methods.
Contribution
The paper proposes a novel keypoint-driven learning paradigm, including a new prediction head, dynamic sample assignment, and a smooth OKS loss, to enhance real-time pose estimation accuracy and efficiency.
Findings
Achieves 3.2/6.7 AP improvement without pre-training on MS COCO and CrowdPose.
Achieves 7.4/4.9 AP improvement with pre-training.
Fewer parameters and higher inference speed than baseline models.
Abstract
Single-stage multi-person pose estimation aims to jointly perform human localization and keypoint prediction within a unified framework, offering advantages in inference efficiency and architectural simplicity. Consequently, multi-scale real-time detection architectures, such as YOLO-like models, are widely adopted for real-time pose estimation. However, these approaches typically inherit a box-driven modeling paradigm from object detection, in which pose estimation is implicitly constrained by bounding-box supervision during training. This formulation introduces biases in sample assignment and feature representation, resulting in task misalignment and ultimately limiting pose estimation accuracy. In this work, we revisit box-driven single-stage pose estimation from a keypoint-driven perspective and identify semantic conflicts among parallel objectives as a key source of performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Human Motion and Animation
