H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning

Zhanbo Huang; Xiaoming Liu; Yu Kong

arXiv:2605.22629·cs.CV·May 22, 2026

H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning

Zhanbo Huang, Xiaoming Liu, Yu Kong

PDF

TL;DR

H-Flow is a self-supervised, physics-inspired multi-modal learning approach that estimates dense human scene flow from monocular video, capturing pose and surface deformation without requiring dense supervision.

Contribution

It introduces a unified transformer-based model that jointly predicts pose, depth, and surface flow, leveraging biomechanical priors and a new synthetic benchmark for training.

Findings

01

Outperforms existing scene-flow and parametric models on standard benchmarks.

02

Generalizes zero-shot to in-the-wild videos.

03

Provides dense flow annotations across diverse subjects and garments.

Abstract

Parametric human models capture global pose but cannot represent the non-rigid surface dynamics of clothing and soft tissue. Generic scene flow estimates dense motion but breaks down on articulated bodies, where pixel-level supervision is also intractable to acquire. We introduce H-Flow, a dense human scene flow that captures both skeletal kinematics and surface deformation. A unified multi-head transformer estimates flow from monocular video, jointly predicting pose and depth as companion outputs. The challenge lies in the lack of supervision. In place of unattainable labels, we anchor the network in the physics of human motion, encoding geometric, structural, and biomechanical priors as cross-modal training objectives. We further introduce DynAct4D, a high-fidelity synthetic benchmark providing dense flow annotations across diverse subjects, garments, and motions. On standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.