PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation

Mengyuan Liu; Jiajie Liu; Jinyan Zhang; Wenhao Li; Junsong Yuan

arXiv:2512.16494·cs.CV·December 19, 2025

PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation

Mengyuan Liu, Jiajie Liu, Jinyan Zhang, Wenhao Li, Junsong Yuan

PDF

Open Access

TL;DR

PoseMoE introduces a mixture-of-experts network that disentangles 2D pose and depth features, improving monocular 3D human pose estimation accuracy by reducing the influence of uncertain depth features.

Contribution

The paper proposes PoseMoE, a novel mixture-of-experts architecture that separately refines 2D pose and depth features, with a cross-expert module for better feature aggregation, addressing limitations of previous entangled encoding methods.

Findings

01

Outperforms existing lifting-based methods on Human3.6M, MPI-INF-3DHP, and 3DPW datasets.

02

Effectively disentangles 2D pose and depth features, reducing depth uncertainty impact.

03

Enhances feature representation through cross-expert spatio-temporal aggregation.

Abstract

The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Human Motion and Animation