SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting
Ruicong Liu, Yifei Huang, Liangyang Ouyang, Caixin Kang, Yoichi Sato

TL;DR
SFHand is a novel streaming framework that predicts future 3D hand states from continuous video and language instructions, enabling real-time, task-oriented hand forecasting for AR and robotics.
Contribution
It introduces SFHand, the first streaming, language-guided 3D hand forecasting framework with a new large-scale dataset EgoHaFL, advancing real-time hand prediction capabilities.
Findings
SFHand outperforms prior methods by up to 35.8% in 3D hand forecasting.
The learned representations improve downstream manipulation task success rates by up to 13.4%.
The framework effectively incorporates language guidance and temporal context for real-time applications.
Abstract
Real-time 3D hand forecasting is a critical component for fluid human-computer interaction in applications like AR and assistive robotics. However, existing methods are ill-suited for these scenarios, as they typically require offline access to accumulated video sequences and cannot incorporate language guidance that conveys task intent. To overcome these limitations, we introduce SFHand, the first streaming framework for language-guided 3D hand forecasting. SFHand autoregressively predicts a comprehensive set of future 3D hand states, including hand type, 2D bounding box, 3D pose, and trajectory, from a continuous stream of video and language instructions. Our framework combines a streaming autoregressive architecture with an ROI-enhanced memory layer, capturing temporal context while focusing on salient hand-centric regions. To enable this research, we also introduce EgoHaFL, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Robot Manipulation and Learning
