CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images
Sookwan Han, Hanbyul Joo

TL;DR
This paper introduces CHORUS, a self-supervised method that learns 3D human-object spatial relations from synthesized images generated by a text-to-image model, overcoming annotation challenges and enabling scalable spatial reasoning.
Contribution
It is the first to utilize a generative image model for learning 3D human-object spatial relations and proposes a comprehensive framework for reasoning from synthetic 2D cues.
Findings
Synthesized images are sufficient for learning 3D spatial relations.
The method effectively disambiguates interaction types via semantic clustering.
A new metric evaluates 3D spatial learning quality.
Abstract
We present a method for teaching machines to understand and model the underlying spatial common sense of diverse human-object interactions in 3D in a self-supervised way. This is a challenging task, as there exist specific manifolds of the interactions that can be considered human-like and natural, but the human pose and the geometry of objects can vary even for similar interactions. Such diversity makes the annotating task of 3D interactions difficult and hard to scale, which limits the potential to reason about that in a supervised way. One way of learning the 3D spatial relationship between humans and objects during interaction is by showing multiple 2D images captured from different viewpoints when humans interact with the same type of objects. The core idea of our method is to leverage a generative model that produces high-quality 2D images from an arbitrary text prompt input as an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
CHORUS : Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images· youtube
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
