End-to-End Spatial-Temporal Transformer for Real-time 4D HOI Reconstruction

Haoyu Zhang; Wei Zhai; Yuhang Yang; Yang Cao; Zheng-Jun Zha

arXiv:2603.14435·cs.CV·March 17, 2026

End-to-End Spatial-Temporal Transformer for Real-time 4D HOI Reconstruction

Haoyu Zhang, Wei Zhai, Yuhang Yang, Yang Cao, Zheng-Jun Zha

PDF

Open Access

TL;DR

This paper introduces THO, an end-to-end spatial-temporal transformer model that reconstructs 4D human-object interactions from monocular videos in real-time, significantly outperforming prior methods in speed and accuracy.

Contribution

The paper presents a novel end-to-end transformer architecture that leverages spatial-temporal priors for real-time 4D HOI reconstruction from monocular videos, overcoming previous latency and accuracy issues.

Findings

01

Operates at 31.5 FPS on a single GPU

02

Achieves over 600x speedup compared to prior methods

03

Improves reconstruction accuracy and temporal consistency

Abstract

Monocular 4D human-object interaction (HOI) reconstruction - recovering a moving human and a manipulated object from a single RGB video - remains challenging due to depth ambiguity and frequent occlusions. Existing methods often rely on multi-stage pipelines or iterative optimization, leading to high inference latency, failing to meet real-time requirements, and susceptibility to error accumulation. To address these limitations, we propose THO, an end-to-end Spatial-Temporal Transformer that predicts human motion and coordinated object motion in a forward fashion from the given video and 3D template. THO achieves this by leveraging spatial-temporal HOI tuple priors. Spatial priors exploit contact-region proximity to infer occluded object features from human cues, while temporal priors capture cross-frame kinematic correlations to refine object representations and enforce physical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Human Motion and Animation