ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation

Nanjun Li; Pinqi Cheng; Zean Liu; Minghe Tian; Xuanyin Wang

arXiv:2603.08681·cs.CV·March 10, 2026

ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation

Nanjun Li, Pinqi Cheng, Zean Liu, Minghe Tian, Xuanyin Wang

PDF

Open Access

TL;DR

ER-Pose introduces a keypoint-driven framework for real-time multi-person pose estimation that removes bounding-box constraints, leading to significant accuracy improvements and higher efficiency compared to traditional box-driven methods.

Contribution

The paper proposes a novel keypoint-driven learning paradigm, including a new prediction head, dynamic sample assignment, and a smooth OKS loss, to enhance real-time pose estimation accuracy and efficiency.

Findings

01

Achieves 3.2/6.7 AP improvement without pre-training on MS COCO and CrowdPose.

02

Achieves 7.4/4.9 AP improvement with pre-training.

03

Fewer parameters and higher inference speed than baseline models.

Abstract

Single-stage multi-person pose estimation aims to jointly perform human localization and keypoint prediction within a unified framework, offering advantages in inference efficiency and architectural simplicity. Consequently, multi-scale real-time detection architectures, such as YOLO-like models, are widely adopted for real-time pose estimation. However, these approaches typically inherit a box-driven modeling paradigm from object detection, in which pose estimation is implicitly constrained by bounding-box supervision during training. This formulation introduces biases in sample assignment and feature representation, resulting in task misalignment and ultimately limiting pose estimation accuracy. In this work, we revisit box-driven single-stage pose estimation from a keypoint-driven perspective and identify semantic conflicts among parallel objectives as a key source of performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Human Motion and Animation