TL;DR
This paper introduces COINS, a transformer-based generative model that synthesizes realistic human-3D scene interactions with semantic control, capable of handling multiple atomic interactions without requiring composite data.
Contribution
The paper presents a novel joint representation and a transformer model for compositional human-scene interaction synthesis with semantic control, extending datasets and outperforming existing methods.
Findings
Generated interactions are realistic and semantically controlled.
The model handles multiple atomic interactions simultaneously.
Outperforms existing methods in perceptual studies.
Abstract
Synthesizing natural interactions between virtual humans and their 3D environments is critical for numerous applications, such as computer games and AR/VR experiences. Our goal is to synthesize humans interacting with a given 3D scene controlled by high-level semantic specifications as pairs of action categories and object instances, e.g., "sit on the chair". The key challenge of incorporating interaction semantics into the generation framework is to learn a joint representation that effectively captures heterogeneous information, including human body articulation, 3D object geometry, and the intent of the interaction. To address this challenge, we design a novel transformer-based generative model, in which the articulated 3D human body surface points and 3D objects are jointly encoded in a unified latent space, and the semantics of the interaction between the human and objects are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
