Reconstructing Action-Conditioned Human-Object Interactions Using Commonsense Knowledge Priors
Xi Wang, Gen Li, Yen-Ling Kuo, Muhammed Kocabas, Emre Aksan, Otmar, Hilliges

TL;DR
This paper introduces a novel method that leverages commonsense knowledge from large language models to infer diverse 3D human-object interactions from single images, improving reconstruction accuracy without requiring contact or scene supervision.
Contribution
It proposes an action-conditioned approach that uses language model priors for 3D reasoning of human-object interactions, enabling generalization across categories and interaction types.
Findings
Improved 3D reconstruction accuracy on a large dataset.
Effective qualitative results on real images.
Demonstrated generalization to diverse objects and interactions.
Abstract
We present a method for inferring diverse 3D models of human-object interactions from images. Reasoning about how humans interact with objects in complex scenes from a single 2D image is a challenging task given ambiguities arising from the loss of information through projection. In addition, modeling 3D interactions requires the generalization ability towards diverse object categories and interaction types. We propose an action-conditioned modeling of interactions that allows us to infer diverse 3D arrangements of humans and objects without supervision on contact regions or 3D scene geometry. Our method extracts high-level commonsense knowledge from large language models (such as GPT-3), and applies them to perform 3D reasoning of human-object interactions. Our key insight is priors extracted from large language models can help in reasoning about human-object contacts from textural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
