End-to-End Spatial-Temporal Transformer for Real-time 4D HOI Reconstruction
Haoyu Zhang, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha

TL;DR
This paper introduces THO, an end-to-end spatial-temporal transformer model that reconstructs 4D human-object interactions from monocular videos in real-time, significantly outperforming prior methods in speed and accuracy.
Contribution
The paper presents a novel end-to-end transformer architecture that leverages spatial-temporal priors for real-time 4D HOI reconstruction from monocular videos, overcoming previous latency and accuracy issues.
Findings
Operates at 31.5 FPS on a single GPU
Achieves over 600x speedup compared to prior methods
Improves reconstruction accuracy and temporal consistency
Abstract
Monocular 4D human-object interaction (HOI) reconstruction - recovering a moving human and a manipulated object from a single RGB video - remains challenging due to depth ambiguity and frequent occlusions. Existing methods often rely on multi-stage pipelines or iterative optimization, leading to high inference latency, failing to meet real-time requirements, and susceptibility to error accumulation. To address these limitations, we propose THO, an end-to-end Spatial-Temporal Transformer that predicts human motion and coordinated object motion in a forward fashion from the given video and 3D template. THO achieves this by leveraging spatial-temporal HOI tuple priors. Spatial priors exploit contact-region proximity to infer occluded object features from human cues, while temporal priors capture cross-frame kinematic correlations to refine object representations and enforce physical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Human Motion and Animation
