HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching

Zerui Chen; Rolandos Alexandros Potamias; Shizhe Chen; Jiankang Deng; Cordelia Schmid; Stefanos Zafeiriou

arXiv:2604.10836·cs.CV·April 14, 2026

HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching

Zerui Chen, Rolandos Alexandros Potamias, Shizhe Chen, Jiankang Deng, Cordelia Schmid, Stefanos Zafeiriou

PDF

TL;DR

HO-Flow is a novel framework that synthesizes realistic, temporally coherent 3D hand-object interactions from text and 3D object data, advancing motion generation in vision and robotics.

Contribution

It introduces a unified latent representation and a flow matching model for improved generalization and temporal reasoning in hand-object interaction synthesis.

Findings

01

Achieves state-of-the-art results on GRAB, OakInk, and DexYCB benchmarks.

02

Effectively models rich interaction dynamics and motion diversity.

03

Enhances generalization through relative object motion prediction.

Abstract

Generating realistic 3D hand-object interactions (HOI) is a fundamental challenge in computer vision and robotics, requiring both temporal coherence and high-fidelity physical plausibility. Existing methods remain limited in their ability to learn expressive motion representations for generation and perform temporal reasoning. In this paper, we present HO-Flow, a framework for synthesizing realistic hand-object motion sequences from texts and canoncial 3D objects. HO-Flow first employs an interaction-aware variational autoencoder to encode sequences of hand and object motions into a unified latent manifold by incorporating hand and object kinematics, enabling the representation to capture rich interaction dynamics. It then leverages a masked flow matching model that combines auto-regressive temporal reasoning with continuous latent generation, improving temporal coherence. To further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.