Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning

Yuquan Bi; Hongsong Wang; Xinli Shi; Zhipeng Gui; Jie Gui; Yuan Yan Tang

arXiv:2508.21363·cs.CV·March 10, 2026

Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning

Yuquan Bi, Hongsong Wang, Xinli Shi, Zhipeng Gui, Jie Gui, Yuan Yan Tang

PDF

TL;DR

This paper introduces an efficient diffusion-based framework for 3D human pose estimation that employs hierarchical temporal pruning to significantly reduce computational cost while maintaining state-of-the-art accuracy.

Contribution

It proposes a novel hierarchical temporal pruning strategy that dynamically prunes redundant pose tokens at multiple levels, improving efficiency without sacrificing performance.

Findings

01

Reduces training MACs by 38.5%

02

Decreases inference MACs by 56.8%

03

Speeds up inference by 81.1% on benchmark datasets

Abstract

Diffusion models have demonstrated strong capabilities in generating high-fidelity 3D human poses, yet their iterative nature and multi-hypothesis requirements incur substantial computational cost. In this paper, we propose an Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics. HTP operates in a staged, top-down manner: (1) Temporal Correlation-Enhanced Pruning (TCEP) identifies essential frames by analyzing inter-frame motion correlations through adaptive temporal graph construction; (2) Sparse-Focused Temporal MHSA (SFT MHSA) leverages the resulting frame-level sparsity to reduce attention computation, focusing on motion-relevant tokens; and (3) Mask-Guided Pose Token Pruner (MGPTP) performs…

Tables11

Table 1. TABLE I: Quantitative Comparison with the SOTA Methods on the Human3.6M Dataset. F F : The Number of Input Frames. CE: Estimating Center Frame Only. Detector: Using CPN [ 4 ] and SH [ 32 ] as the 2D Keypoints Detector to Generate the Inputs, or Using the Ground Truth 2D Keypoints as Inputs. † Indicates the Scratch Setting in [ 72 ] , and ‡ Indicates the Finetune Setting in [ 72 ] . HTP (Ours) Utilizes the Default D3DP Backbone; Variants like HTP w/ MixSTE Demonstrate Plug-and-play Generalization. The Best and Second-Best Results Are Highlighted in Bold and Underline Formats.

Method	Type	Publication	$F$	CE	Human3.6M (DET)			Human3.6M (GT)		MACs (G)	Params (M)
Method	Type	Publication	$F$	CE	Detector	MPJPE $↓$	P-MPJPE $↓$	Detector	MPJPE $↓$	MACs (G)	Params (M)
TCN [33]	CNN	CVPR’19	243	–	CPN	46.8	36.5	GT	37.8	–	–
GLA-GCN [58]	GCN	ICCV’23	243	✓	CPN	44.4	34.8	GT	21.0	–	–
FTCM [42]	MLP	TCSVT’24	351	✓	CPN	45.3	35.3	GT	28.2	1.5	4.7
PoseMamba-X [18]	Mamba	AAAI’25	243	✕	CPN	37.1	31.5	GT	–	109.9	26.5
SAMA-L [29]	Mamba	ICCV’25	243	✕	CPN	36.9	31.3	GT	–	53.2	17.3
PoseFormer [69]	Transformer	ICCV’21	81	✓	CPN	44.3	36.5	GT	31.3	1.6	9.6
P-STMO [36]	Transformer	ECCV’22	243	✓	CPN	42.8	34.4	GT	29.3	1.7	7.0
PoseFormerV2 [67]	Transformer	CVPR’23	243	✓	CPN	45.2	35.6	GT	35.5	2.1	14.4
MHFormer [25]	Transformer	CVPR’22	351	✓	CPN	43.0	34.4	GT	30.5	9.6	24.7
MixSTE [65]	Transformer	CVPR’22	243	✕	CPN	40.9	30.6	GT	21.6	139.0	33.8
STCFormer [43]	Transformer	CVPR’23	243	✕	CPN	40.5	31.8	GT	21.3	78.2	18.9
MotionBERT [72]^†	Transformer	ICCV’23	243	✕	SH	39.2	–	GT	17.8	174.8	42.5
MotionBERT [72]^‡	Transformer	ICCV’23	243	✕	SH	37.5	–	GT	16.9	174.8	42.5
HOT [26]	Transformer	CVPR’24	243	✕	CPN	41.0	–	GT	–	83.8	35.0
TC-MixSTE [49]	Transformer	TMM’24	243	✕	CPN	39.9	31.9	GT	21.1	–	–
DualFormer [71]	Transformer	TCSVT’24	351	✕	CPN	42.8	34.4	GT	28.9	–	–
HTP w/ MixSTE	Transformer	–	243	✕	CPN	39.9	29.9	GT	20.7	87.6	36.4
HTP w/ MotionBERT^†	Transformer	–	243	✕	SH	38.9	–	GT	17.7	101.7	47.6
Diffpose [12]	Diffusion	CVPR’23	243	✕	CPN	36.9	28.7	GT	18.9	–	–
Diffpose [15]	Diffusion	ICCV’23	64	✕	–	42.9	30.8	GT	–	–	–
D3DP [37]	Diffusion	ICCV’23	243	✕	CPN	35.4	28.7	GT	18.4	139.1	34.8
FinePOSE [54]	Diffusion	CVPR’24	243	✕	CPN	31.9	25.0	GT	16.7	146	200.6
KTPFormer [34]	Diffusion	CVPR’24	243	✕	CPN	33.0	26.2	GT	18.1	139.1	36.3
HTP (Ours)	Diffusion	–	243	✕	CPN	29.9	23.3	GT	16.7	87.7	37.5

Table 2. TABLE II: Quantitative Comparison with SOTA Methods on the Human3.6M Dataset Under MPJPE for Various Actions. D i r . , D i s c . , ⋯ , Dir.,Disc.,\cdots, and W a l k T . WalkT. Correspond to 15 Action Classes. A v g Avg Indicates the Average MPJPE Among 15 Action Classes. The Best and Second-Best Results Are Highlighted in Bold and Underline Formats.

Method	Type	MPJPE $↓$
Method	Type	Dir.	Disc.	Eat	Greet	Phone	Photo	Pose	Pur.	Sit	SitD.	Smoke	Wait	WalkD.	Walk	WalkT.	Avg.
TCN [33]	CNN	45.2	46.7	43.3	45.6	48.1	55.1	44.6	44.3	57.3	65.8	47.1	44.0	49.0	32.8	33.9	46.8
DUE [64]	GCN	37.9	41.9	36.8	39.5	40.8	49.2	40.1	40.7	47.9	53.3	40.2	41.1	40.3	30.8	28.6	40.6
GLA-GCN [58]	GCN	41.3	44.3	40.8	41.8	45.9	54.1	42.1	41.5	57.8	62.9	45.0	42.8	45.9	29.4	29.9	44.4
FTCM [42]	MLP	42.2	44.4	42.4	42.4	47.7	55.8	42.7	41.9	58.7	64.5	46.1	44.2	45.2	30.6	31.1	45.3
PoseFormer [69]	Transformer	41.5	44.8	39.8	42.5	46.5	51.6	42.1	42.0	53.3	60.7	45.5	43.3	46.1	31.8	32.2	44.3
GraFormer [68]	Transformer	45.2	50.8	48.0	50.0	54.9	65.0	48.2	47.1	60.2	70.0	51.6	48.7	54.1	39.7	43.1	51.8
P-STMO [36]	Transformer	38.9	42.7	40.4	41.1	45.6	49.7	40.9	39.9	55.5	59.4	44.9	42.2	42.7	29.4	29.4	42.8
MixSTE [65]	Transformer	36.7	39.0	36.5	39.4	40.2	44.9	39.8	36.9	47.9	54.8	39.6	37.8	39.3	29.7	30.6	39.8
MHFormer [25]	Transformer	39.2	43.1	40.1	40.9	44.9	51.2	40.6	41.3	53.5	60.3	43.7	41.1	43.8	29.8	30.6	43.0
STCFormer [43]	Transformer	38.4	41.2	36.8	38.0	42.7	50.5	38.7	38.2	52.5	56.8	41.8	38.4	40.2	26.2	27.7	40.5
MotionBERT [72]^‡	Transformer	36.1	37.5	35.8	32.1	40.3	46.3	36.1	35.3	46.9	53.9	39.5	36.3	35.8	25.1	25.3	37.5
TC-MixSTE [49]	Transformer	36.9	40.9	36.3	39.0	41.6	48.7	38.4	39.3	50.3	54.9	40.6	38.0	40.6	26.5	26.0	39.9
DualFormer [71]	Transformer	38.9	43.1	39.2	41.4	45.1	50.7	41.5	41.2	51.7	60.5	43.4	41.4	43.0	29.9	31.0	42.8
Diffpose [12]	Diffusion	33.2	36.6	33.0	35.6	37.6	45.1	35.7	35.5	46.4	49.9	37.3	35.6	36.5	24.4	24.1	36.9
D3DP [37]	Diffusion	33.0	34.8	31.7	33.1	37.5	43.7	34.8	33.6	45.7	47.8	37.0	35.0	35.0	24.3	24.1	35.4
FinePOSE [54]	Diffusion	31.4	31.5	28.8	29.7	34.3	36.5	29.2	30.0	42.0	42.5	33.3	31.9	31.4	22.6	22.7	31.9
KTPFormer [34]	Diffusion	30.1	32.1	29.1	30.6	35.4	39.3	32.8	30.9	43.1	45.5	34.7	33.2	32.7	22.1	23.0	33.0
HTP (Ours)	Diffusion	28.5	30.0	26.4	27.2	31.5	36.0	28.8	27.7	39.5	39.1	30.7	29.1	30.7	21.7	22.3	29.9

Table 3. TABLE III: Quantitative Comparison with SOTA Methods on the Human3.6M Dataset Under P-MPJPE for Various Actions. The Best and Second-Best Results Are Highlighted in Bold and Underline Formats.

Method	Type	P-MPJPE $↓$
Method	Type	Dir.	Disc.	Eat	Greet	Phone	Photo	Pose	Pur.	Sit	SitD.	Smoke	Wait	WalkD.	Walk	WalkT.	Avg.
TCN [33]	CNN	34.1	36.1	34.4	37.2	36.4	42.2	34.4	33.6	45.0	52.5	37.4	33.8	37.8	26.6	27.3	36.5
DUE [64]	GCN	30.3	34.6	29.6	31.7	31.6	38.9	31.8	31.9	39.2	42.8	32.1	32.6	31.4	25.1	23.8	32.5
GLA-GCN [58]	GCN	32.4	35.3	32.6	34.2	35.0	42.1	32.1	31.9	45.5	49.5	36.1	32.4	35.6	23.5	24.7	34.8
FTCM [42]	MLP	31.9	35.1	34.0	34.2	36.0	42.1	32.3	31.2	46.6	51.9	36.5	33.8	34.4	23.8	24.9	35.3
PoseFormer [69]	Tansformer	34.1	36.1	34.4	37.2	36.4	42.2	34.4	33.6	45.0	52.5	37.4	33.8	37.8	25.6	27.3	36.5
P-STMO [36]	Tansformer	31.3	35.2	32.9	33.9	35.4	39.3	32.5	31.5	44.6	48.2	36.3	32.9	34.4	23.8	23.9	34.4
MixSTE [65]	Tansformer	30.8	33.1	30.3	31.8	33.1	39.1	31.1	30.5	42.5	44.5	34.0	30.8	32.7	22.1	22.9	32.6
MHFormer [25]	Tansformer	31.5	34.9	32.8	33.6	35.3	39.6	32.0	32.2	43.5	48.7	36.4	32.6	34.3	23.9	25.1	34.4
STCFormer [43]	Tansformer	29.3	33.0	30.7	30.6	32.7	38.2	29.7	28.8	42.2	45.0	33.3	29.4	31.5	20.9	22.3	31.8
TC-Mixste [49]	Transformer	29.5	32.8	28.9	31.6	32.8	37.9	29.8	29.1	41.8	44.3	33.5	30.6	32.2	21.2	22.2	31.9
DualFormer [71]	Transformer	31.4	34.9	32.5	34.3	35.1	39.4	33.0	32.0	42.9	48.9	36.2	32.9	33.5	23.7	25.1	34.4
D3DP [37]	Diffusion	27.5	29.4	26.6	27.7	29.2	34.3	27.5	26.2	37.3	39.0	30.3	27.7	28.2	19.6	20.3	28.7
KTPFormer [34]	Diffusion	24.1	26.7	24.2	24.9	27.3	30.6	25.2	23.4	34.1	35.9	28.1	25.3	25.9	17.8	18.8	26.2
HTP (Ours)	Diffusion	21.8	23.4	21.2	21.9	23.8	27.9	21.9	20.8	31.2	31.2	24.6	21.8	23.5	17.3	17.4	23.3

Table 4. TABLE IV: Quantitative Comparison with SOTA Methods on the MPI-INF-3DHP Dataset. The Best and Second-Best Results Are Highlighted in Bold and Underline Formats.

Method	Type	$F$	MPI-INF-3DHP
Method	Type	$F$	PCK $↑$	AUC $↑$	MPJPE $↓$
TCN [33]	CNN	81	86.0	51.9	84.0
GLA-GCN [58]	GCN	81	98.5	79.1	27.8
FTCM [42]	MLP	81	98.0	79.8	31.2
PoseFormer [69]	Transformer	9	88.6	56.4	77.1
P-STMO [36]	Transformer	81	97.9	75.8	32.2
MixSTE [65]	Transformer	27	94.4	66.5	54.9
PoseFormerV2 [67]	Transformer	81	97.9	78.8	27.8
MHFormer [25]	Transformer	9	93.8	63.3	58.0
TC-MixSTE [49]	Transformer	81	98.7	79.5	27.6
DualFormer [71]	Transformer	9	97.8	73.4	40.1
Diffpose [12]	Diffusion	81	98.0	75.9	29.1
D3DP [37]	Diffusion	81	98.0	79.1	28.1
KTPFormer [34]	Diffusion	81	99.0	79.3	29.1
FinePOSE [54]	Diffusion	81	98.9	80.0	26.2
HTP (Ours)	Diffusion	81	99.5	80.5	26.4

Table 5. TABLE V: MACs and Speed Comparison with Diffusion-Based 3D HPE Methods. All Models Are Evaluated Under the Same Setting. Best Results Are Bolded .

Setting: $K = 1$ , $H = 20$
Method	MPJPE $↓$	Params (M)	Train	Inference
Method	MPJPE $↓$	Params (M)	MACs/frame	MACs/frame	FPS $↑$
D3DP [37]	38.8	34.8	0.58 G	22.9 G	772.7
FinePose [54]	40.0	200.6	0.60 G	22.9 G	723.7
KTPFormer [34]	39.5	36.3	0.58 G	23.6 G	705.8
HTP (Ours)	32.9	37.5	0.36 G	10.0 G	2277.5
Setting: $K = 10$ , $H = 20$
D3DP [37]	35.4	34.8	0.58 G	228.8 G	79.6
FinePose [54]	31.9	200.6	0.60 G	236.2 G	73.8
KTPFormer [34]	33.0	36.3	0.58 G	228.8 G	73.5
HTP (Ours)	29.9	37.5	0.36 G	99.8 G	137.0

Table 6. TABLE VI: Efficiency Comparison with Non-Diffusion Baselines on Human3.6M.

Setting: $K = 1$ , $H = 1$
Method	Type	MPJPE $↓$	Params (M)	MACs/frame
PoseFormer [69]	Transformer	44.3	9.5	1.62 G
PoseFormerV2 [67]	Transformer	45.2	9.5	2.10 G
STCFormer [43]	Transformer	40.8	18.9	0.32 G
HTP w/ MixSTE	Transformer	39.9	36.4	0.36 G
D3DP [37]	Diffusion	40.0	34.6	1.14 G
HTP (ours) w/ $n_{1} 3$	Diffusion	39.8	37.5	0.72 G
HTP (ours) w/ $n_{1} = 1$	Diffusion	40.6	37.5	0.50 G

Table 7. TABLE VII: Ablation Study on the Location of MGPTP Module on the Human3.6M Dataset.

$f$	$n_{1}$	MPJPE $↓$	MACs (G)
54	1	33.0	60.6
54	2	32.0	74.2
54	3	29.9	87.7
54	4	30.8	101.2

Table 8. TABLE IX: Adjusting the Temporal Node Number η \eta in TCEP During Inference.

$η$	MPJPE	FPS
243	30.3	242.5
162	29.9	244.6
81	34.4	254.3

Table 9. TABLE XI: Ablation Study on Different Designs of HTP.

Setting	TCEP	SFT	MGPTP	MPJPE $↓$	Param	MACs
Setting	TCEP	MHSA	MGPTP	MPJPE $↓$	Param	MACs
Baseline				34.7	35.5	143.375
Setting1	✓			31.9	36.3 $↑_{0.8}$	143.392 $↑_{0.017}$
Setting2	✓	✓		33.3	36.3 $↑_{0.0}$	143.392 $↑_{0.000}$
Setting3	✓		✓	31.6	37.5 $↑_{2.0}$	87.648 $↓_{55.727}$
HTP (Ours)	✓	✓	✓	29.9	37.5 $↑_{2.0}$	87.648 $↓_{55.727}$

Table 10. TABLE XII: Analysis of the Impact of the Sparse Mask 𝐌 \mathbf{M} .

Setting	Temporal MHSA		MGPTP		MPJPE $↓$
Setting	w $𝐌$	w/o $𝐌$	w $𝐌$	w/o $𝐌$	MPJPE $↓$
Setting4		✓		✓	32.7
Setting5		✓	✓		31.6
Setting6	✓			✓	31.3
HTP (Ours)	✓		✓		29.9

Table 11. TABLE XIII: Impact of Input Sequence Length F F .

F	Batch Size	MPJPE $↓$	Inference
F	Batch Size	MPJPE $↓$	MACs/frame	FPS $↑$
81	4	34.4	119.2	139.2
81	32	34.4	119.2	140.6
162	4	34.3	119.2	130.2
162	12	34.3	119.2	130.8
243	4	29.9	99.8	137.0

Equations42

q (y_{t} ∣ y_{0}) := \overline{α}_{t} y_{0} + 1 - \overline{α}_{t} ϵ,

q (y_{t} ∣ y_{0}) := \overline{α}_{t} y_{0} + 1 - \overline{α}_{t} ϵ,

L = \frac{1}{J} k = 1 \sum J ∥ y_{0} - \hat{y}_{0} ∥_{2} .

L = \frac{1}{J} k = 1 \sum J ∥ y_{0} - \hat{y}_{0} ∥_{2} .

y_{h, t - 1} = \overline{α}_{t - 1} \cdot \hat{y}_{h, 0} + 1 - \overline{α}_{t - 1} - σ_{t}^{2} \overset{ϵ}{^}_{t} + σ_{t} ϵ,

y_{h, t - 1} = \overline{α}_{t - 1} \cdot \hat{y}_{h, 0} + 1 - \overline{α}_{t - 1} - σ_{t}^{2} \overset{ϵ}{^}_{t} + σ_{t} ϵ,

\overset{ϵ}{^}_{t} σ_{t} = (y_{h, t} - \overline{α}_{t} \cdot \hat{y}_{h, 0}) / 1 - \overline{α}_{t} = (1 - \overline{α}_{t - 1}) / (1 - \overline{α}_{t}) \cdot 1 - \overline{α}_{t} / \overline{α}_{t - 1} .

\overset{ϵ}{^}_{t} σ_{t} = (y_{h, t} - \overline{α}_{t} \cdot \hat{y}_{h, 0}) / 1 - \overline{α}_{t} = (1 - \overline{α}_{t - 1}) / (1 - \overline{α}_{t}) \cdot 1 - \overline{α}_{t} / \overline{α}_{t - 1} .

A_{T} = \frac{( A _{F} + A ^ _{F} ) + ( A _{F} + A ^ _{F} ) ^{'}}{2},

A_{T} = \frac{( A _{F} + A ^ _{F} ) + ( A _{F} + A ^ _{F} ) ^{'}}{2},

S^{(j)} = \frac{Y _{t}^{(j)} ( Y _{t}^{(j)} ) ^{⊤}}{D} \in R^{F \times F},

S^{(j)} = \frac{Y _{t}^{(j)} ( Y _{t}^{(j)} ) ^{⊤}}{D} \in R^{F \times F},

M_{pq}^{(j)} = {1, 0, p = q or q / p \in Top_{p / q} (S^{(j)}, η), otherwise.

M_{pq}^{(j)} = {1, 0, p = q or q / p \in Top_{p / q} (S^{(j)}, η), otherwise.

\overset{ˇ}{S}_{pq}^{(j)} = {S_{pq}^{(j)}, - \infty, if M_{pq}^{(j)} = 1, otherwise,

\overset{ˇ}{S}_{pq}^{(j)} = {S_{pq}^{(j)}, - \infty, if M_{pq}^{(j)} = 1, otherwise,

\overset{ˇ}{A}_{T}^{(j)} = A_{T} ⊙ σ_{1} (\overset{ˇ}{S}^{(j)}) .

\overset{ˇ}{A}_{T}^{(j)} = A_{T} ⊙ σ_{1} (\overset{ˇ}{S}^{(j)}) .

Y_{t}^{' (j)} = Y_{t}^{(j)} + σ_{2} (\overset{ˇ}{A}_{T}^{(j)} Y_{t}^{(j)} ⊙ W) \in R^{F \times D},

Y_{t}^{' (j)} = Y_{t}^{(j)} + σ_{2} (\overset{ˇ}{A}_{T}^{(j)} Y_{t}^{(j)} ⊙ W) \in R^{F \times D},

M_{pq}^{' (j)} = {0, - \infty, if M_{pq}^{(j)} = 1, if M_{pq}^{(j)} = 0,

M_{pq}^{' (j)} = {0, - \infty, if M_{pq}^{(j)} = 1, if M_{pq}^{(j)} = 0,

head_{i} = softmax (\frac{Q _{i} K _{i}^{T}}{d _{k}} + M^{'}) V_{i},

head_{i} = softmax (\frac{Q _{i} K _{i}^{T}}{d _{k}} + M^{'}) V_{i},

\tilde{Y}_{t} = Concat (head_{1}, \dots, head_{h}) W_{O} + Y_{t}^{'} .

\tilde{Y}_{t} = Concat (head_{1}, \dots, head_{h}) W_{O} + Y_{t}^{'} .

\tilde{Y}_{t}^{'} = MLP (LN (\tilde{Y}_{t})) + \tilde{Y}_{t} .

\tilde{Y}_{t}^{'} = MLP (LN (\tilde{Y}_{t})) + \tilde{Y}_{t} .

d_{m} (z_{p}, z_{q}) = {\frac{∥ z _{p} - z _{q} ∥ _{2}}{D}, Λ, \overline{M}_{pq} = 1, \overline{M}_{pq} = 0,

d_{m} (z_{p}, z_{q}) = {\frac{∥ z _{p} - z _{q} ∥ _{2}}{D}, Λ, \overline{M}_{pq} = 1, \overline{M}_{pq} = 0,

KNN_{m} (z_{p}) = {z_{q} \in z_{t} ∣ d_{m} (z_{p}, z_{q}) \leq d_{m} (z_{p}, NN_{k} (z_{p}))} .

KNN_{m} (z_{p}) = {z_{q} \in z_{t} ∣ d_{m} (z_{p}, z_{q}) \leq d_{m} (z_{p}, NN_{k} (z_{p}))} .

φ_{p} = exp - \frac{1}{k} z_{q} \in KNN_{m} (z_{p}) \sum d_{m} (z_{p}, z_{q})^{2} .

φ_{p} = exp - \frac{1}{k} z_{q} \in KNN_{m} (z_{p}) \sum d_{m} (z_{p}, z_{q})^{2} .

\tilde{s}_{p} = {s_{p}, - \infty, if s_{p} > 0, otherwise.

\tilde{s}_{p} = {s_{p}, - \infty, if s_{p} > 0, otherwise.

\overset{φ}{^}_{p} = φ_{p} \cdot σ_{1} (\tilde{s}_{p}),

\overset{φ}{^}_{p} = φ_{p} \cdot σ_{1} (\tilde{s}_{p}),

ω_{p} = {min_{q : \overset{φ}{^}_{q} > \overset{φ}{^}_{p}} d_{m} (z_{p}, z_{q}), max_{q} d_{m} (z_{p}, z_{q}), if \exists \overset{φ}{^}_{q} > \overset{φ}{^}_{p}, otherwise.

ω_{p} = {min_{q : \overset{φ}{^}_{q} > \overset{φ}{^}_{p}} d_{m} (z_{p}, z_{q}), max_{q} d_{m} (z_{p}, z_{q}), if \exists \overset{φ}{^}_{q} > \overset{φ}{^}_{p}, otherwise.

\overset{ˉ}{Y}_{t} = P_{I} (\tilde{Y}_{t}^{'}) \in R^{J \times f \times D},

\overset{ˉ}{Y}_{t} = P_{I} (\tilde{Y}_{t}^{'}) \in R^{J \times f \times D},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning

Yuquan Bi, Hongsong Wang, Xinli Shi, Zhipeng Gui, Jie Gui, and Yuan Yan Tang *Corresponding author: Jie Gui and Hongsong Wang.*Yuquan Bi is with the School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China (e-mail: [email protected]).H. Wang is with School of Computer Science and Engineering, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, Southeast University, Nanjing 210096, China (e-mail: [email protected]).Xinli Shi is with the National Center for Applied Mathematics, Southeast University, Nanjing 211189, China (e-mail: xinli [email protected]).Z. Gui is with the School of Remote Sensing and Information Engineering and the Collaborative Innovation Center of Geospatial Technology, Wuhan University, Wuhan 430079, China (e-mail: [email protected]).Jie Gui is with the School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China, also with Purple Mountain Laboratories, Nanjing 211111, China, and also with the Engineering Research Center of Blockchain Application, Supervision and Management (Southeast University), Ministry of Education, Nanjing 210000, China (e-mail: [email protected]).Yuan Yan Tang is with the Department of Computer and Information Science, University of Macau, Macau, China, and also with Faculty of Science and Technology, UOW College Hong Kong, Hong Kong, China (e-mail: [email protected]).

Abstract

Diffusion models have demonstrated strong capabilities in generating high-fidelity 3D human poses, yet their iterative nature and multi-hypothesis requirements incur substantial computational cost. In this paper, we propose an efficient diffusion-based 3D human pose estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics. HTP operates in a staged, top-down manner: (1) Temporal Correlation-Enhanced Pruning (TCEP) identifies essential frames by analyzing inter-frame motion correlations through adaptive temporal graph construction; (2) Sparse-Focused Temporal MHSA (SFT MHSA) leverages the resulting frame-level sparsity to reduce attention computation, focusing on motion-relevant tokens; and (3) Mask-Guided Pose Token Pruner (MGPTP) performs fine-grained semantic pruning via clustering, retaining only the most informative pose tokens. Experiments on Human3.6M and MPI-INF-3DHP show that HTP reduces training MACs by 38.5%, inference MACs by 56.8%, and improves inference speed by an average of 81.1% compared to prior diffusion-based methods, while achieving state-of-the-art performance.

I Introduction

3D human pose estimation (HPE) from monocular videos is a fundamental task advancing rapidly in recent years for its significant applications in action recognition [55, 47, 23, 45, 63], human-robot interaction [73, 41, 46], and virtual reality [13, 56, 51]. Benefiting from the excellent performance of 2D pose detectors [4, 32, 40, 11, 48, 27], the 2D-to-3D lifting pipeline [1, 17, 57, 62, 64, 53, 3, 16] has become dominant due to its high precision and lightweight nature.

The 2D-to-3D lifting pipeline lacks depth priors and suffers from ambiguity. To mitigate this issue, recent works incorporate temporal correlations across video frames into the pose reconstruction process. For example, many transformer-based architectures [2, 69, 67, 43, 72, 30, 65, 36, 49, 70, 71] effectively capture long-range temporal dependencies by encoding the joint-level semantics of each video frame into pose tokens, achieving promising performance even on extremely long video sequences. However, the computational cost for spatial-temporal modeling in self-attention (SA) increases quadratically as the number of frames increases, resulting in substantial computational overhead.

Diffusion-based 3D HPE leverages transformer architectures to resolve depth ambiguity through the iterative refinement of high-fidelity 3D pose generation. These methods employ a transformer-based diffusion process, which requires $K$ steps of iterative refinement to generate $H$ pose hypotheses during inference. However, the inherent computational complexity of diffusion models, combined with the transformer-based SA mechanisms, leads to significant resource demands. For instance, processing a 243-frame video sequence with D3DP [37] requires 1.15G MACs per frame during training, but this increases to 228.8G per frame during inference with $H=20$ and $K=10$ . Although adjusting $H$ and $K$ offers a theoretical trade-off between accuracy and efficiency, the combined cost of diffusion steps and transformer operations makes it challenging to achieve both simultaneously.

A straightforward approach to reduce the primary computational bottleneck in 3D HPE is to eliminate redundant pose tokens in the temporal SA calculations. Existing methods typically adopt two disjoint strategies: (1) Frame-level pruning that discards adjacent frames under static redundancy assumptions, and (2) Semantic-level sparsification that clusters low-information tokens via heuristic criteria. While these single-stage strategies effectively reduce computation, they often overlook subtle yet crucial motion transitions. More importantly, such approaches are not well-suited for diffusion-based 3D HPE, where pose reconstruction unfolds iteratively across multiple noise levels. Simply applying naive pruning risks discarding informative content at intermediate steps, thereby compromising motion continuity and stability. To address these challenges, we propose an efficient diffusion-based 3D human pose estimation framework with Hierarchical Temporal Pruning (HTP), which operates across both frame-level and semantic-level stages to preserve essential motion dynamics throughout the denoising process. By selectively retaining key frames and salient pose tokens at each denoising iteration, HTP maintains the integrity of global motion patterns while reducing computational cost.

Specifically, we implement HTP through a structured, hierarchical pruning framework across both frame and semantic levels. First, the Temporal Correlation-Enhanced Pruning (TCEP) module analyzes temporal correlations across video frames. Each node represents a video frame, and we compute a dense correlation matrix to measure inter-frame similarity. Based on this, our Correlation-Enhanced Node Selection Algorithm constructs a dynamic temporal graph and selects a subset of nodes with strong temporal relevance as representative frames. A Sparse Binary Mask $\mathbf{M}$ is generated to store the retained temporal relationships. Second, based on the temporal correlations identified by TCEP, the Sparse-Focused Temporal Multi-Head Self-Attention (SFT MHSA) uses $\mathbf{M}$ to guide attention toward motion-relevant pose tokens. By restricting attention to key frames, SFT MHSA reduces computational overhead while preserving the model’s ability to capture global temporal dependencies. Finally, we apply the Mask-Guided Pose Token Pruner (MGPTP), which integrates frame-level correlations from TCEP and sparse pose token representations from SFT MHSA. MGPTP discards redundant pose tokens while preserving tokens critical to motion fidelity using a density-aware strategy guided by the sparse mask $\mathbf{M}$ . Together, these modules form a cohesive hierarchical denoising framework that enhances computational efficiency and preserves motion fidelity in diffusion-based 3D HPE. As shown in Fig. 1, HTP reduces training MACs by an average of $38.5\%$ , setting a new standard in efficient diffusion-based 3D HPE. The contributions of this paper are as follows:

•

We propose Hierarchical Temporal Pruning (HTP), a unified hierarchical pruning framework integrated into diffusion-based 3D HPE that reduces both frame- and token-level redundancy to improve efficiency, overcoming the limitations of previous single-stage strategies.

•

TCEP, SFT MHSA, and MGPTP operate under a unified sparse constraint $\mathbf{M}$ to collaboratively reduce temporal redundancy and preserve motion-critical dynamics. All modules are plug-and-play and compatible with both diffusion- and transformer-based 3D HPE pipelines.

•

Extensive experiments on Human3.6M and MPI-INF-3DHP show that HTP achieves state-of-the-art accuracy while reducing training MACs by 38.5%, inference MACs by 56.8%, and boosting FPS by 81.1% on average.

II Related Work

II-A Transformer-Based 3D HPE

The Transformer, first proposed by Vaswani et al. [44], has demonstrated remarkable performance in computer vision (CV) tasks [24, 8, 50, 21, 59], as the self-attention mechanism has a strong ability to capture long-range dependencies. This characteristic makes it particularly well-suited for 3D HPE. PoseFormer [69] was the first to adopt the vision transformer as a backbone network for video-based 3D HPE. MixSTE [65] alternates between spatial and temporal transformer blocks to capture spatio-temporal features, providing 3D pose estimates for each frame in the input sequence. DualFormer [71] further enhances performance via dual-path attention across joints and frames. GKONet [16] incorporates geometric priors into a graph-guided transformer for structure-aware prediction. STCFormer [43] reduces computational complexity by separately modeling spatial and temporal components of input joints. MotionBERT [72] introduces a dual-stream spatial-temporal transformer to model long-range spatial-temporal relationships, and is finetuned for skeletal joint-based tasks. However, the quadratic complexity of self-attention in spatial-temporal modeling results in substantial computational overhead.

II-B Diffusion-Based 3D HPE

Diffusion models are a class of generative models that progressively degrade observed data by adding noise, and then restore the original data through a reverse denoising process. These models have demonstrated promising results across various applications, such as image $/$ video generation [22, 35, 52], super-resolution [5, 35], and Human Motion Generation [60, 66, 20, 7]. Recently, several 3D HPE methods based on diffusion models [12, 37, 38, 54, 34] have been proposed to generate high-fidelity 3D human poses, aiming to address the challenge of intrinsic depth ambiguity. D3DP [37, 38] integrates a denoiser based on MixSTE [65] to reconstruct noisy 3D poses by assembling joint-by-joint multiple hypotheses. FinePose [54] learns modifiers for different human body parts to describe human movements at multiple levels of granularity. KTPFormer [34] incorporates the anatomical structure of the human body and joint motion trajectories across frames as prior knowledge to learn spatial and temporal correlations. However, the reliance on iterative refinement ( $K$ steps) and multiple hypotheses ( $H$ samples) significantly increases the computational burden, posing greater efficiency challenges compared to transformer-based counterparts. Distinct from prior diffusion-based approaches [37, 54, 34] that primarily focus on enhancing generation fidelity, HTP is explicitly designed to mitigate this computational bottleneck. By introducing a hierarchical temporal pruning strategy, it effectively optimizes the trade-off between efficiency and performance.

II-C Improving Efficiency in 3D HPE

Enhancing computational efficiency in 3D human pose estimation is crucial for real-world applications, especially in resource-limited environments. While prior works have explored pruning strategies to improve temporal modeling efficiency, most adopt a single-level approach. Frame-level methods, such as DeciWatch [61], G-SFormer [6], and Uplift [10], reduce computational cost by sampling or scheduling representative frames based on temporal redundancy. Token-level methods, including HOT [26] and P-STMO [36], focus on semantic sparsification by removing low-saliency tokens or clustering similar features. While these single-strategy approaches help reduce complexity, they often fail to account for interactions between temporal structure and semantic content. This can lead to suboptimal retention of motion-critical information, particularly in dynamic scenes or under iterative refinement processes like diffusion. In contrast, our approach integrates pruning at both frame and semantic levels within a unified framework. This hierarchical design enables more informed token selection and better preserves motion coherence throughout the denoising, providing a more robust and adaptive solution than methods relying on a single strategy.

III Preliminary of Diffusion-Based 3D HPE

Diffusion models are generative frameworks that characterize data distributions via a time-dependent diffusion process with two phases: (1) a training phase, where data is progressively perturbed by adding noise, and a denoiser is trained to reverse this perturbation, and (2) an inference phase, where the trained denoiser reconstructs the original, uncorrupted data.

III-A Training

Starting from the ground truth input 3D pose $\bm{y}_{0}$ , a sequence of noisy samples $\{\bm{y}_{t}\}_{t=1}^{T}$ is generated, where $T$ denotes the total number of timesteps. During this process, standard Gaussian noise is introduced to $\bm{y}_{0}$ , progressively transforming it into a Gaussian distribution $\bm{y}_{T}\sim p_{T}$ . Following DDPMs [14], the perturbation of $\bm{y}_{t}$ can be expressed as:

[TABLE]

with $\alpha_{t}=1-\beta_{t}$ , $\overline{\alpha}_{t}=\prod_{s=1}^{t}\alpha_{s}$ , and $\epsilon\sim\mathcal{N}(0,\bm{I})$ . where $\{\beta_{t}\}_{t=1}^{T}$ denotes the variance schedule.

Subsequently, $\bm{y}_{t}$ is passed to the Denoiser conditioned on 2D keypoints $\bm{x}$ and timestep $t$ to reconstruct the original 3D pose $\bm{\hat{y}}_{0}$ . The entire framework is optimized through the standard MPJPE (Mean Per Joint Position Error) loss, which minimizes the average Euclidean distance between the predicted and ground-truth 3D joint positions:

[TABLE]

III-B Inference

During inference, the reverse diffusion process is applied to recover the original 3D pose $\bm{y}_{0}$ by iteratively denoising the noisy sample $\bm{y}_{T}\sim p_{T}$ . Following D3DP [37], we adopt the multi-hypothesis strategy within the DDIM [39] framework, focusing on predicting the original input rather than the noise. The reverse process can be articulated as follows:

[TABLE]

where $t$ and $t-1$ are the adjacent timesteps in the subset $\tau\subset\{1,\ldots,T\}$ . $h\subset\{0,\ldots,H\}$ , $\epsilon\sim\mathcal{N}(0,\bm{I})$ , and

[TABLE]

Beginning by sampling $H$ initial 3D poses $\bm{y}_{h,T}$ from a unit Gaussian and then fedding the samples into the denoiser to produce $H$ viable 3D pose hypotheses $\bm{y}_{h,0}$ . This process is repeated iteratively for $K$ steps, with the timestep $t$ updated as $T\left(1-\frac{k}{K}\right)$ at each iteration $k\in[1,K]$ .

In line with other diffusion-based methods [37, 54, 34], we employ the Joint-Wise Reprojection-Based Multi-Hypothesis Aggregation (JPMA) technique [37] to aggregate and evaluate the final 3D pose predictions.

IV Method

IV-A Overview

Following the formulation in Sec. III, our framework estimates the clean 3D pose $\bm{y}_{0}$ from its noisy observation $\bm{y}_{t}$ conditioned on 2D keypoints $\bm{x}$ . As illustrated in Fig. 2, the input pair is first projected into a high-dimensional representation and processed by the Spatial GCN [30] and the Spatial MHSA to encode skeletal topology. The output $\bm{Y}_{t}$ then undergoes our core Hierarchical Temporal Pruning (HTP) strategy, which operates in a coarse-to-fine manner:

Frame-Level Pruning: This phase focuses on filtering redundancy while maintaining full temporal resolution ( $F$ ). First, the Temporal Correlation-Enhanced Pruning (TCEP) module initiates the hierarchy by establishing a sparse topology to filter static frames. Subsequently, the diffusion timestep embedding $\mathcal{F}(t)$ is injected into the feature map. The Sparse-Focused Temporal MHSA (SFT MHSA) performs transitional refinement: it models long-range dependencies strictly within this sparse structure. By enhancing the feature discriminability of the retained frames, it acts as a semantic bridge, preparing the representation for the subsequent hard pruning.

Semantic-Level Pruning: Advancing the hierarchy, the Mask-Guided Pose Token Pruner (MGPTP) executes ”hard-pruning” by physically compressing the sequence length from $F$ to $f$ . It aggregates the refined tokens from SFT MHSA into high-level descriptors, achieving deep semantic abstraction.

The condensed sequence $\bar{\bm{Y}}_{t}$ is then processed by $n-n_{1}$ standard encoder blocks for deep refinement. Finally, a Cross MHSA restores the full temporal resolution (from $f$ back to $F$ ) for the final prediction $\hat{\bm{y}}_{0}$ . During inference, we follow the reverse process detailed in Sec. III, employing a deterministic DDIM [39] sampler to recover 3D poses from pure Gaussian noise. Specifically, we iteratively refine $H$ initial hypotheses over $K$ steps and subsequently aggregate these diverse predictions to ensure robust reconstruction.

IV-B Temporal Correlation-Enhanced Pruning (TCEP)

Learning fine-grained temporal correlations is crucial for accurate 3D human pose estimation, particularly in diffusion-based frameworks where iterative denoising must preserve subtle motion cues. However, constructing dense temporal connections for each pose token risks computational inefficiency and semantic drift. To address this within the Frame-level Pruning phase, we design the TCEP module to establish the structural foundation of our hierarchy. By explicitly modeling sparse temporal dependencies, TCEP constructs a preliminary sparse topology by dynamically selecting semantically correlated frames for each joint while filtering out irrelevant connections, thereby ensuring that subsequent feature learning is strictly focused on motion-critical regions.

Given the input $\bm{Y}_{t}$ , TCEP first overlays the learnable global temporal topology matrix $\mathbf{\hat{A}}_{F}\in\mathbb{R}^{F\times F}$ onto the original adjacency matrix $\mathbf{A}_{F}\in\mathbb{R}^{F\times F}$ , as outlined by [30, 34]. This combination can be expressed as:

[TABLE]

where ′ denotes the matrix transpose to ensure symmetry. As illustrated in Fig. 2 top left, we apply a Correlation-Enhanced Node Selecting algorithm to dynamically select temporal correlation nodes and refine their connections. For each joint $j\in\{1,\ldots,J\}$ , $\bm{Y}_{t}^{(j)}\in\mathbb{R}^{F\times D}$ denotes its temporal token sequence. We compute a scaled similarity matrix:

[TABLE]

where each entry $\mathbf{S}^{(j)}_{pq}$ quantifies the correlation between frame $p$ and $q$ .

To sparsify the topology, we suppress the diagonal (i.e., self-attention) and retain only the top- $\eta$ highest-scoring off-diagonal neighbors for each row $p$ , forming the directed neighborhood set $\mathrm{Top}_{p}(\mathbf{S}^{(j)},\eta)\subset\{1,\ldots,F\}$ . These selections are then used to construct a binary mask $\mathbf{M}^{(j)}\in\{0,1\}^{F\times F}$ by restoring self-loops and applying symmetric completion:

[TABLE]

The detailed procedure for generating this joint-wise symmetric mask is summarized in Algorithm 1 Phase 1. To enforce sparse support, we update the similarity matrix by setting non-selected entries to a large negative constant, forming the masked version $\check{\mathbf{S}}^{(j)}$ that satisfies:

[TABLE]

this masked matrix ensures that only top- $\eta$ correlations are retained during the subsequent softmax normalization.

The masked similarity $\check{\mathbf{S}}^{(j)}$ is normalized using the softmax function, and fused with the global adjacency $\mathbf{A}_{T}$ via Hadamard product to yield a joint-specific attention matrix:

[TABLE]

Finally, the pruned temporal graph is used to refine joint-wise tokens through linear projection and nonlinear activation with residual connection:

[TABLE]

where $\mathbf{W}\in\mathbb{R}^{D\times D}$ is a shared linear layer, and $\sigma_{2}(\cdot)$ is the GELU activation function.

We then stack $\{\bm{Y}_{t}^{\prime(j)}\}_{j=1}^{J}$ and $\{\mathbf{M}^{(j)}\}_{j=1}^{J}$ to reconstruct $\bm{Y}^{\prime}_{t}\in\mathbb{R}^{J\times F\times D}$ and the final temporal mask $\mathbf{M}\in\{0,1\}^{J\times F\times F}$ for downstream modules.

IV-C Sparse-Focused Temporal MHSA (SFT MHSA)

Serving as an intermediate refinement within the Frame-level Pruning phase, we introduce the SFT MHSA module. Guided by the sparse temporal mask $\mathbf{M}$ generated by TCEP, it restricts attention computations to semantically correlated frames. Importantly, the module functions as a semantic bridge, enhancing the distinctiveness of the selected frames and ensuring that the tokens are semantically robust before undergoing physical compression in the next phase.

The architecture of the SFT MHSA module is illustrated in Fig. 2 (b). Given the temporally correlated pose token sequence $\bm{Y}^{\prime}_{t}\in\mathbb{R}^{J\times F\times D}$ , we first apply a LayerNorm and project the features into multi-head attention components: query ( $\bm{Q}$ ), key ( $\bm{K}$ ), and value ( $\bm{V}$ ). These are obtained via linear projections using learnable weight matrices $\bm{W}_{Q}$ , $\bm{W}_{K}$ , and $\bm{W}_{V}\in\mathbb{R}^{D\times D}$ , respectively. The multi-head self-attention computation is then performed in parallel across $h$ attention heads, each with dimension $d_{k}=D/h$ . As shown in Algorithm 1 Phase 2, to enforce sparsity, we define an additive mask $\mathbf{M}^{\prime}\in\mathbb{R}^{J\times F\times F}$ constructed by applying the following rule to each joint-specific binary mask $\mathbf{M}^{(j)}\in\{0,1\}^{F\times F}$ :

[TABLE]

where $p,q\in\{1,\ldots,F\}$ . These joint-level masks are stacked to obtain $\mathbf{M}^{\prime}$ , ensuring that only relevant temporal connections are preserved during attention computation, while others are suppressed by assigning near-zero weights via softmax. The attention for each head is then computed as:

[TABLE]

where $\bm{Q}_{i},\bm{K}_{i},\bm{V}_{i}\in\mathbb{R}^{J\times F\times d_{k}}$ are the $i$ -th head-specific projections. The outputs of all heads are concatenated and projected via $\bm{W}_{O}\in\mathbb{R}^{D\times D}$ , followed by a residual connection:

[TABLE]

The result is further processed through a LayerNorm-MLP block with GELU activation, and finalized with a second residual connection:

[TABLE]

IV-D Mask-Guided Pose Token Pruner (MGPTP)

Advancing to the Semantic-level Pruning phase, we propose the MGPTP module to transform the refined frame-level features into a compact semantic representation. Operating on the temporally refined tokens from SFT MHSA, MGPTP dynamically selects a compact subset of semantically informative frames by leveraging the learned attention masks. This mechanism aggregates temporal context into high-level semantic descriptors, maximizing computational efficiency while preserving essential motion cues.

As shown in Fig. 3, given the pose tokens after SFT MHSA, $\bm{\tilde{Y}}^{\prime}_{t}\in\mathbb{R}^{J\times F\times D}$ , we apply average pooling along the joint dimension to obtain frame-wise tokens $\mathbf{z}_{t}\in\mathbb{R}^{F\times D}$ . Simultaneously, the joint-wise temporal mask $\mathbf{M}\in\{0,1\}^{J\times F\times F}$ is aggregated across joints via the same pooling operation, producing a smoothed frame-wise attention mask $\overline{\mathbf{M}}\in\mathbb{R}^{F\times F}$ . This unified step ensures both the token representation and its corresponding temporal guidance share consistent semantic granularity across frames.

Then, a novel mask-guided density peaks clustering based on the k-nearest neighbors algorithm is employed to cluster pose tokens with high motion relevance, guided by $\overline{\mathbf{M}}$ . Specifically, for any pair of frame tokens $\mathbf{z}_{p},\mathbf{z}_{q}\in\mathbf{z}_{t}$ , the mask-guided euclidean distance is defined as:

[TABLE]

where $\Lambda=\max_{p,q}\|\mathbf{z}_{p}-\mathbf{z}_{q}\|_{2}/\sqrt{D}+\varepsilon$ . $\varepsilon>0$ is a small constant ensuring that pairs masked out by $\overline{\mathbf{M}}$ are strictly farther than any valid pair, effectively excluding them during neighborhood search.

Let $k$ be a fixed neighborhood size, and $\mathrm{NN}_{k}(\mathbf{z}_{p})$ be the $k$ -th nearest point to $\mathbf{z}_{p}$ according to $\mathrm{d_{m}}$ . Thus, the mask-guided k-nearest neighbors $\mathrm{KNN}_{\mathrm{m}}(\cdot)$ of $\mathbf{z}_{p}$ is defined as:

[TABLE]

The local density $\varphi_{p}$ of frame $p$ is then computed via a Guassian kernel over $\mathrm{KNN}_{\mathrm{m}}(\mathbf{z}_{p})$ :

[TABLE]

To emphasize frames that maintain richer temporal connectivity, we aggregate mask support per frame $s_{p}=\sum_{q=1}^{F}\overline{\mathbf{M}}_{pq}$ , apply a stability-aware transformation:

[TABLE]

The mask-guided response density $\hat{\varphi}_{p}$ is then computed by:

[TABLE]

where $\sigma_{1}(\cdot)$ denotes the softmax function applied across all $\tilde{s}_{p}$ . Next, the minimal distance $\omega_{p}$ to higher-density neighbors is computed for each frame:

[TABLE]

Following the clustering principle in [9], the saliency score of each token $\mathbf{z}_{p}$ is defined as $\omega_{p}\times\hat{\varphi}_{p}$ . We then select the top- $f$ cluster centers and obtain an ordered index set $I=\{i_{1}<i_{2}<\cdots<i_{f}\}$ on the temporal axis. Finally, we define an order-preserving temporal selection operator $\mathcal{P}_{I}$ applied to the original joint-wise token sequence, which extracts the temporal slices at positions $I$ uniformly across all joints. The resulting sequence is

[TABLE]

which retains the full spatial and feature dimensions while compressing the temporal axis. This pruned sequence maintains temporal coherence and semantic consistency across all joints, enabling more efficient downstream reasoning.

V Experiments

V-A Datasets and Metrics

1) Human3.6M[19]: Human3.6M is the largest and most widely used indoor benchmark dataset for HPE tasks, comprising 3.6 million RGB images covering 15 activities performed by 11 actors. Videos are recorded at 50Hz using four synchronized and calibrated cameras. Following [25, 65, 37] , our model is trained on 5 subjects (S1, S5, S6, S7, S8) and evaluated on 2 subjects (S9, S11). For evaluation metrics, we report the mean per joint position error (MPJPE) and Procrustes MPJPE (P-MPJPE).

2) MPI-INF-3DHP [31]: MPI-INF-3DHP is a recently popular dataset consisting of indoor and outdoor scenes. The training set contains 8 activities performed by 8 actors, while the test set covers 7 activities. Following the protocol in [36], we use the area under the curve (AUC), percentage of correct keypoints (PCK), and MPJPE as evaluation metrics.

V-B Implementation Details

1) Training Details: We use CPN [4] as the 2D keypoint detector to generate the 2D inputs. The numbers of SFT MHSA $n_{1}$ are set to 3. The batch size is set to 4, with each sample containing a pose sequence of 243 frames. For temporal pruning, the pruning length $f$ is selected based on dataset-specific temporal characteristics. For Human3.6M, which contains long and high-frame-rate sequences with substantial temporal redundancy, a pruning length of $f=54$ provides an effective balance between removing redundant frames and retaining sufficient motion granularity. For MPI-INF-3DHP, where sequences are shorter and motion cues are more concentrated, a conservative pruning ratio of 3:1 ( $f=27$ ) is adopted to preserve essential fine-grained motion information. We adopt the AdamW [28] optimizer with the momentum parameters of $\beta_{1},\beta_{2}=0.9,0.999$ , and a weight decay of 0.1. We train our model for 150 epochs and the initial learning rate is $\text{6e}^{-5}$ with a shrink factor of 0.993 after each epoch. For fair comparisons, we set the number of hypotheses $H=1$ and iterations $K=1$ during training, and $H=20$ and $K=10$ during inference, as in D3DP [37].

2) Implementation of Plug-and-Play Integration: To verify generality, we integrated HTP into Transformer-based frameworks. All variants were trained from scratch to ensure fair comparison. For MixSTE [65], the integration follows the identical architectural modifications as our main method. For the dual-stream MotionBERT [72], we inserted TCEP and an initial SFT MHSA layer prior to the backbone. Inside the first $n_{1}=2$ DSTformer blocks, we replaced the original Temporal MHSA components with our SFT MHSA. The MGPTP module was inserted after the second block, allowing the remaining layers to process the efficient, pruned sequence.

3) Implementation of Efficiency Evaluation: We report the total Multiply-Accumulate operations (MACs), representing the cumulative computational cost aggregated across all $K$ denoising iterations and $H$ hypotheses. These values are computed via the THOP library, following the standard convention where $1\text{ MACs}\approx 2\text{ FLOPs}$ . Inference speed (FPS) is measured on two NVIDIA GeForce RTX 4090 GPUs with batch size of 8 under FP32 precision. The FPS is calculated by dividing the total number of processed frames by the total wall-clock time of inference, excluding data loading overhead.

V-C Quantitative Results

1) Human3.6M: Tab. I presents comparisons between our HTP and recent state-of-the-art (SOTA) 3D HPE methods on the Human3.6M dataset. Our approach achieves SOTA performance, with MPJPE of $29.9\text{mm}$ and P-MPJPE of $23.3\text{mm}$ using 2D poses detected by CPN [4] as inputs, and MPJPE of $16.7\text{mm}$ when using ground-truth 2D poses as inputs. With an $81\%$ reduction in parameter count and a $40.0\%$ decrease in computational cost, HTP surpasses the previous SOTA method FinePose [54] by $2.0\text{mm}$ in MPJPE and $1.7\text{mm}$ in P-MPJPE.

Additionally, to demonstrate the plug-and-play capability of our design, we integrate the proposed Hierarchical Temporal Pruning strategy into two Transformer-based frameworks: MixSTE [65] and MotionBERT [72]. On MixSTE, HTP reduces MACs by $37\%$ while improving MPJPE by $1.0\text{mm}$ and P-MPJPE by $0.7\text{mm}$ . On MotionBERT, we observe similar gains, with a $42\%$ reduction in MACs and improved pose accuracy across both metrics. We further compare HTP with emerging Mamba-based methods, PoseMamba-X [18] and SAMA-L [29]. While these state-space models offer competitive efficiency, HTP demonstrates superior reconstruction fidelity, outperforming SAMA-L by 7.0 mm in MPJPE.

Further, Tab.II and Tab.III report per-action results on the Human3.6M dataset under MPJPE and P-MPJPE metrics, using detected 2D keypoints [4] as input. Our HTP framework consistently achieves the lowest error across all 15 action categories, outperforming all prior state-of-the-art methods. Notably, significant improvements are observed in challenging categories such as “SitD”, “Walk”, and “WalkT”, highlighting its ability to handle diverse motion dynamics with precision.

2) MPI-INF-3DHP: Tab. IV presents comparisons between our method and recent SOTA 3D HPE approaches on the MPI-INF-3DHP dataset. Following [37, 54], our model is trained using ground-truth 2D poses as inputs. Compared to recent SOTA works [54, 34], HTP maintains comparable MPJPE while improving PCK by $0.5\%$ and AUC by $0.5\%$ . While the performance gain is less pronounced than in Human3.6M due to the reduced temporal redundancy available for pruning in shorter sequences, these results confirm that HTP remains robust and effective even under constrained temporal resolutions.

3) MACs and Speed: Tab. V presents a comprehensive comparison between HTP and recent diffusion-based 3D HPE methods across multiple sampling configurations ( $K{=}1$ , $10$ ) with a fixed inference horizon ( $H{=}20$ ). Under the low-sampling regime ( $K{=}1$ ), HTP achieves a substantial acceleration of 3 $\times$ in inference FPS while reducing per-frame MACs by over 56%, along with the lowest MPJPE of 32.9. As $K$ increases, HTP consistently achieves both lower error and lower inference cost compared to all prior methods, demonstrating its robustness to different sampling budgets. Notably, even with $K{=}1$ , our method surpasses the accuracy of several baselines operating at $K{=}10$ (e.g., 32.9 MPJPE vs. 35.4 of D3DP), while reducing MACs from 457.6G to 20.0G and boosting FPS from 142.2 to 2443.9—representing a 16 $\times$ speedup. HTP thus achieves strong generalization and cost-efficiency across budget settings, enabling real-world deployment.

4) Efficiency Comparison with Non-Diffusion Baselines: We compare our diffusion-based HTP with representative Transformers in Tab. VI. Compared to the seq-to-frame baseline PoseFormer [69] (1.62 G/frame), our HTP ( $n_{1}=1$ ) is drastically more efficient (0.50 G/frame) and accurate (40.6 mm vs. 44.3 mm). Against the efficient seq-to-seq STCFormer [43] (0.32 G/frame), our method achieves higher accuracy (40.6 mm vs. 40.8 mm) at a comparable computational scale. Moreover, our high-fidelity setting ( $n_{1}=3$ ) further reduces the error to 39.8 mm while maintaining an affordable cost (0.72 G/frame), effectively positioning diffusion models as a high-performance competitor to lightweight transformers.

V-D Sensitivity Analysis and Ablation Study

We conduct comprehensive sensitivity and ablation studies on the Human3.6M dataset to validate the design of HTP. Specifically, we examine: (1)–(3) key architectural hyperparameters, including the pruned sequence length $f$ , module placement $n_{1}$ , and neighborhood sizes ( $\eta,k$ ); (4) the flexibility of adjusting parameters ( $\eta,n_{1}$ ) at inference time to balance efficiency; (5)–(6) the incremental contribution of each module and the specific role of the sparse mask $\mathbf{M}$ in guiding attention and pruning; and (7) the impact of input sequence length $F$ on the trade-off between performance and computational cost.

1) Sequence length after MGPTP Module: The number of representative pose tokens retained by the MGPTP module plays a crucial role in performance. As shown in Tab. VIII, increasing the sequence length does not lead to the anticipated performance improvements. This is because retaining too many pose tokens during clustering can amplify the impact of discrete or less informative tokens, which disrupts the coherent understanding of overall motion patterns. Thus, we set $f=54$ in all experiments to achieve an optimal balance.

2) Location of MGPTP: In Tab. VIII, we explore different settings for the number of SFT MHSA layers $n_{1}$ , which effectively determines the placement of the MGPTP module. Adjusting $n_{1}$ allows for a trade-off between computational efficiency and performance. From the results, we observe that lower values of $n_{1}$ lead to reduced performance, as pruning pose tokens too early limits the network’s ability to learn aggregated motion information through SFT MHSA. Conversely, setting $n_{1}$ too high can result in attention collapse, which negatively impacts performance. We find that $n_{1}=3$ provides the best balance between efficiency and accuracy, and thus use this setting for all experiments.

3) Hyperparameters $\eta$ and $k$ : As visualized in Fig. 4, we analyze the sensitivity of $\eta$ (in TCEP) and $k$ (in MGPTP). For $\eta$ , performance peaks at $\eta=162$ ; lower values (e.g., 81) fail to capture sufficient context, while full sequences ( $\eta=243$ ) introduce redundancy that negates pruning benefits. Regarding $k$ , a compact neighborhood ( $k=2$ ) yields the lowest MPJPE (29.9 mm). Increasing $k$ tends to over-smooth local density estimates, obscuring distinct motion states. Consequently, we adopt $\eta=162$ and $k=2$ as the default configuration.

4) Inference Hyperparameters: We further evaluate the impact of adjusting hyperparameters during the inference phase. As shown in the Tab. X, for the temporal node number $\eta$ , the optimal performance is achieved when the inference value matches the training setting ( $\eta=162$ ). Regarding the number of SFT MHSA layers $n_{1}$ (Tab. X), although matching the training depth ( $n_{1}=3$ ) yields the marginal best MPJPE, it incurs higher computational costs. We thus adopt $n_{1}=1$ as the default inference configuration for efficiency, achieving a significant FPS boost and MACs reduction with only a negligible performance trade-off compared to $n_{1}=3$ .

5) Effect of Each Module: Tab. XI investigates the role of each module through incremental configurations. We begin with a baseline that extends D3DP [37] by adding the Spatial GCN from [30]. In Setting1, we introduce the TCEP module without applying its generated sparse mask, resulting in a significant MPJPE reduction of 2.8 with only a minor increase in parameters (0.8M) and a marginal computational cost of 0.017 G MACs. This highlights TCEP’s effectiveness in capturing essential motion cues by dynamically selecting temporal dependencies. In Setting2, we further adopt the sparse mask from TCEP to replace full temporal attention with global SFT MHSA. However, this setting performs worse than Setting1, indicating that overly aggressive sparsification may lead to attention collapse and hinder temporal reasoning. These results suggest that moderate application of sparse self-attention is essential to maintain global context.

In Setting3, we remove the sparse mask from Setting2 and instead apply the MGPTP module to prune redundant tokens. By utilizing the sparsity mask generated by TCEP to compress the sequence length, this setting achieves a net reduction of 55.7 G MACs, demonstrating that the efficiency gains from hierarchical pruning vastly outweigh the minimal overhead of the pruning decision modules. Finally, the full HTP configuration combines selective sparse attention with MGPTP, achieving the optimal balance between reconstruction accuracy and computational cost.

6) Impact of the Sparse Mask $\mathbf{M}$ : As shown in Tab. XII, we evaluate the role of the mask $\mathbf{M}$ generated by TCEP. Compared to the baseline without mask guidance (Setting 4), applying $\mathbf{M}$ solely to MGPTP (Setting 5) or SFT MHSA (Setting 6) yields moderate gains. However, the full HTP configuration, which leverages $\mathbf{M}$ in both modules, outperforms Setting 5 and Setting 6 by 1.7mm and 1.4mm, respectively. This demonstrates that sparse attention and mask-guided pruning are mutually reinforcing: selective attention enhances feature discriminability for pruning, while accurate token retention preserves global context. Thus, $\mathbf{M}$ serves as a critical structural bridge, integrating the modules into a cohesive pruning pipeline.

7) Impact of Input Sequence Length $F$ : We analyze the trade-off between sequence length and performance in Tab. XIII. Following prior methods [65, 37, 54, 43], we adopt $F=243$ as the default to maximize long-range temporal context. Empirical results show that reducing $F$ to 162 or 81 degrades MPJPE. While shorter sequences reduce VRAM consumption and allow for larger batch sizes, our tests indicate that larger batch sizes yield minimal gains in inference FPS. Given that HTP’s pruning strategy already ensures high efficiency for 243-frame inputs, we prioritize the accuracy benefits of the longer sequence, recommending shorter versions only for memory-constrained environments.

V-E Qualitative Analysis

1) Qualitative results comparison: Fig. 6 compares HTP with state-of-the-art diffusion-based methods [37, 34, 54] on Human3.6M ( $H=20,K=10$ ). Across actions ranging from simple poses (e.g., “Photo”) to complex articulations (e.g., “Sitting down”, “Smoking”), HTP consistently yields 3D estimations that align closer to the ground truth. Notably, our method demonstrates superior fidelity in limb joints (e.g., elbows and wrists) and maintains better structural plausibility in challenging poses compared to baselines, effectively mitigating the joint distortions observed in D3DP and KTPFormer.

2) In-the-wild Videos: To validate generalization, we evaluate HTP on wild videos using 2D poses detected by HRNet [40]. As shown in Fig. 6, our method exhibits exceptional robustness, maintaining high accuracy even in challenging scenarios with severe self-occlusion and rapid motion.

3) Qualitative Analysis of Frame Retention: Fig. 8 visualizes the adaptive pruning behavior of HTP on “Walking” and “Sitting”. By selecting only 54 representative frames from the 243-frame input, HTP dynamically allocates computational resources based on motion complexity. Specifically, the model retains a higher density of frames during rapid transitions (blue dashed box) to capture fast-moving dynamics, while aggressively pruning frames during stable or slow-motion phases (cyan/green dashed boxes). This content-aware strategy effectively balances temporal sparsity with representational completeness, maintaining accurate 3D pose estimation while significantly reducing computational overhead.

VI Limitations and Future Works

Despite HTP’s substantial efficiency gains, specific challenges present avenues for improvement. First, as shown in Fig. 7, severe self-occlusions can cause the pruning mechanism to inadvertently discard critical frames needed for resolving complex articulations. Second, as a 2D-to-3D lifting framework, HTP’s performance is bounded by 2D input quality, limiting gains in noisy outdoor scenarios. Future work will address these by exploring occlusion-aware attention and spatial uncertainty modeling to enhance robustness.

VII Conclusions

In this paper, we address the efficiency challenges in diffusion-based 3D human pose estimation with Hierarchical Temporal Pruning (HTP). HTP progressively reduces redundancy by selectively pruning pose tokens across both frame and semantic levels while preserving critical motion dynamics. Starting with temporal correlations in TCEP, focusing attention on SFT MHSA, and refining through MGPTP’s semantic-level clustering, HTP selectively retains the most motion-critical pose tokens. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate that HTP achieves state-of-the-art accuracy while substantially reducing computational cost and improving inference speed.

Bibliography73

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Cai, L. Ge, J. Liu, J. Cai, T. Cham, J. Yuan, and N. M. Thalmann (2019) Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks . In International Conference on Computer Vision , pp. 2272–2281 . Cited by: §I .
2[2] H. Chen, J. He, W. Xiang, Z. Cheng, W. Liu, H. Liu, B. Luo, Y. Geng, and X. Xie (2023) HD Former: high-order directed transformer for 3D human pose estimation . In International Joint Conference on Artificial Intelligence , pp. 581–589 . Cited by: §I .
3[3] T. Chen, C. Fang, X. Shen, Y. Zhu, Z. Chen, and J. Luo (2022) Anatomy-aware 3D human pose estimation with bone-based pose decomposition . IEEE Trans. Circuit Syst. Video Technol. 32 ( 1 ), pp. 198–209 . Cited by: §I .
4[4] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun (2018) Cascaded pyramid network for multi-person pose estimation . In Conference on Computer Vision and Pattern Recognition , pp. 7103–7112 . Cited by: §I , TABLE I , § V-B , § V-C , § V-C .
5[5] H. Chunga, B. Sim, and J. C. Ye (2022) Come-closer-diffuse-faster: accelerating conditional diffusion models for inverse problems through stochastic contraction . In Conference on Computer Vision and Pattern Recognition , pp. 12413–12422 . Cited by: § II-B .
6[6] M. Cui, K. Zhang, and Z. Sun (2024) Graph and skipped transformer: exploiting spatial and temporal modeling capacities for efficient 3D human pose estimation . ar Xiv preprint ar Xiv:2407.02990 . Cited by: § II-C .
7[7] R. Dabral, M. H. Mughal, V. Golyanik, and C. Theobalt (2023) Mofusion: a framework for denoising-diffusion-based motion synthesis . In Conference on Computer Vision and Pattern Recognition , pp. 9760–9770 . Cited by: § II-B .
8[8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x 16 words: transformers for image recognition at scale . In International Conference on Learning Representations , Cited by: § II-A .