SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting

Ruicong Liu; Yifei Huang; Liangyang Ouyang; Caixin Kang; Yoichi Sato

arXiv:2511.18127·cs.CV·May 18, 2026

SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting

Ruicong Liu, Yifei Huang, Liangyang Ouyang, Caixin Kang, Yoichi Sato

PDF

1 Repo 1 Models 1 Datasets

TL;DR

SFHand is a novel streaming framework that predicts future 3D hand states from continuous video and language instructions, enabling real-time, task-oriented hand forecasting for AR and robotics.

Contribution

It introduces SFHand, the first streaming, language-guided 3D hand forecasting framework with a new large-scale dataset EgoHaFL, advancing real-time hand prediction capabilities.

Findings

01

SFHand outperforms prior methods by up to 35.8% in 3D hand forecasting.

02

The learned representations improve downstream manipulation task success rates by up to 13.4%.

03

The framework effectively incorporates language guidance and temporal context for real-time applications.

Abstract

Real-time 3D hand forecasting is a critical component for fluid human-computer interaction in applications like AR and assistive robotics. However, existing methods are ill-suited for these scenarios, as they typically require offline access to accumulated video sequences and cannot incorporate language guidance that conveys task intent. To overcome these limitations, we introduce SFHand, the first streaming framework for language-guided 3D hand forecasting. SFHand autoregressively predicts a comprehensive set of future 3D hand states, including hand type, 2D bounding box, 3D pose, and trajectory, from a continuous stream of video and language instructions. Our framework combines a streaming autoregressive architecture with an ROI-enhanced memory layer, capturing temporal context while focusing on salient hand-centric regions. To enable this research, we also introduce EgoHaFL, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ut-vision/SFHand
github

Models

🤗
ut-vision/SFHand
model

Datasets

ut-vision/EgoHaFL
dataset· 214 dl
214 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Robot Manipulation and Learning