Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos
Yilin Wen, Hao Pan, Lei Yang, Jia Pan, Taku Komura, Wenping Wang

TL;DR
This paper introduces a hierarchical transformer framework that leverages temporal cues at different granularities to improve 3D hand pose estimation and action recognition from egocentric RGB videos, addressing occlusion and ambiguity challenges.
Contribution
It presents a novel hierarchical transformer architecture with cascaded encoders for separate short-term pose estimation and long-term action recognition, enhancing robustness and accuracy.
Findings
Achieves competitive results on FPHA and H2O benchmarks.
Demonstrates effectiveness of hierarchical temporal modeling.
Validates design choices through extensive ablation studies.
Abstract
Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Stroke Rehabilitation and Recovery
