FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation

Huajian Zeng; Lingyun Chen; Jiaqi Yang; Yuantai Zhang; Fan Shi; Peidong Liu; Xingxing Zuo

arXiv:2602.13444·cs.RO·February 17, 2026

FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation

Huajian Zeng, Lingyun Chen, Jiaqi Yang, Yuantai Zhang, Fan Shi, Peidong Liu, Xingxing Zuo

PDF

Open Access

TL;DR

FlowHOI introduces a flow-matching framework that generates semantically grounded, temporally coherent hand-object interaction sequences for dexterous robot manipulation, improving transferability, accuracy, and speed.

Contribution

It presents a novel two-stage flow-matching approach with a reconstruction pipeline for HOI generation, explicitly modeling hand-object interactions for robotic tasks.

Findings

01

Achieves highest action recognition accuracy on GRAB and HOT3D benchmarks.

02

1.7× higher physics simulation success rate compared to diffusion baseline.

03

40× faster inference speed than previous methods.

Abstract

Recent vision-language-action (VLA) models can generate plausible end-effector motions, yet they often fail in long-horizon, contact-rich tasks because the underlying hand-object interaction (HOI) structure is not explicitly represented. An embodiment-agnostic interaction representation that captures this structure would make manipulation behaviors easier to validate and transfer across robots. We propose FlowHOI, a two-stage flow-matching framework that generates semantically grounded, temporally coherent HOI sequences, comprising hand poses, object poses, and hand-object contact states, conditioned on an egocentric observation, a language instruction, and a 3D Gaussian splatting (3DGS) scene reconstruction. We decouple geometry-centric grasping from semantics-centric manipulation, conditioning the latter on compact 3D scene tokens and employing a motion-text alignment loss to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Multimodal Machine Learning Applications