Rethinking Graph Convolution for 2D-to-3D Hand Pose Lifting
Chanyoung Kim, Donghyun Kim, Dong-Hyun Sim, Seong Jae Hwang, Youngjoong Kwon

TL;DR
This paper demonstrates that adaptive spatial attention mechanisms outperform traditional graph convolutional networks for 2D-to-3D hand pose lifting, especially when incorporating hand topology as a soft prior.
Contribution
It shows that self-attention with input-dependent aggregation surpasses GCNs, and that hand topology is best used as a soft positional encoding rather than a fixed adjacency.
Findings
Self-attention reduces MPJPE from 12.36 mm to 10.09 mm.
Skeleton-constrained graph attention recovers most of the performance gap.
Hand topology as a soft positional encoding is more effective than fixed adjacency.
Abstract
Graph convolutional networks (GCNs) are widely used for 3D hand pose estimation, where the hand skeleton is encoded as a fixed adjacency graph. We revisit whether this is the most effective way to incorporate hand topology in 2D-to-3D lifting. In this paper, we perform controlled, parameter-matched ablations on the FPHA benchmark and show that standard multi-head self-attention consistently outperforms GCN baselines. Even when the GCN is strengthened with multi-hop adjacency and matched parameter count, self-attention reduces MPJPE from 12.36 mm to 10.09 mm. A skeleton-constrained graph attention network recovers most of this gap, indicating that input-dependent aggregation is a major source of improvement, while fully connected attention yields additional gains. We further show that hand topology is most effective when introduced as a soft structural prior through graph-distance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
