Finding 3D Scene Analogies with Multimodal Foundation Models
Junho Kim, Young Min Kim

TL;DR
This paper introduces a zero-shot, open-vocabulary method using multimodal foundation models to find 3D scene analogies, enabling transfer of trajectories and waypoints across complex scenes without additional training.
Contribution
The work presents a novel hybrid neural representation combining vision-language features and 3D shape models for scene analogy in a zero-shot setting.
Findings
Accurately establishes correspondences between complex 3D scenes.
Enables transfer of trajectories and waypoints across scenes.
Operates without additional training or fixed vocabularies.
Abstract
Connecting current observations with prior experiences helps robots adapt and plan in new, unseen 3D environments. Recently, 3D scene analogies have been proposed to connect two 3D scenes, which are smooth maps that align scene regions with common spatial relationships. These maps enable detailed transfer of trajectories or waypoints, potentially supporting demonstration transfer for imitation learning or task plan transfer across scenes. However, existing methods for the task require additional training and fixed object vocabularies. In this work, we propose to use multimodal foundation models for finding 3D scene analogies in a zero-shot, open-vocabulary setting. Central to our approach is a hybrid neural representation of scenes that consists of a sparse graph based on vision-language model features and a feature field derived from 3D shape foundation models. 3D scene analogies are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
