InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement
Nikita Kister, Pradyumna YM, Istv\'an S\'ar\'andi, Jiayi Wang, Anna Khoreva, Gerard Pons-Moll

TL;DR
InHabit is a scalable, automatic method that leverages image foundation models to generate large-scale, photorealistic 3D human-scene interaction datasets, enhancing 3D understanding tasks.
Contribution
We introduce InHabit, a novel render-generate-lift pipeline that creates extensive 3D human-scene interaction data from scene renderings using vision-language models.
Findings
Produced 78K samples across 800 scenes with detailed 3D geometry and human models.
Improved 3D human-scene reconstruction and contact estimation when augmenting training data.
Data was preferred in 78% of cases over the state of the art in a user study.
Abstract
Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world motion capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics that ignore rich scene context. In contrast, 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions. To transfer this knowledge into 3D, we introduce InHabit, a fully automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
