TL;DR
GRAFT is a transformer-based method that efficiently refines 3D human-scene interaction reconstructions from a single image, combining accuracy and speed.
Contribution
It introduces a learned prior that predicts interaction gradients, enabling fast, iterative refinement of human meshes with scene reasoning, applicable as an end-to-end or plug-and-play approach.
Findings
GRAFT improves interaction quality by up to 113% over state-of-the-art feed-forward methods.
It matches optimization-based methods' quality at approximately 50 times lower runtime.
It generalizes well to in-the-wild multi-person scenes and is preferred in 64.8% of user studies.
Abstract
Reconstructing physically plausible 3D human-scene interactions (HSI) from a single image currently presents a trade-off: optimization based methods offer accurate contact but are slow (~20s), while feed-forward approaches are fast yet lack explicit interaction reasoning, producing floating and interpenetration artifacts. Our key insight is that geometry-based human--scene fitting can be amortized into fast feed-forward inference. We present GRAFT (Geometric Refinement And Fitting Transformer), a learned HSI prior that predicts Interaction Gradients: corrective parameter updates that iteratively refine human meshes by reasoning about their 3D relationship to the surrounding scene. GRAFT encodes the interaction state into compact body-anchored tokens, each grounded in the scene geometry via Geometric Probes that capture spatial relationships with nearby surfaces. A lightweight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
