Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement
Walter Goodwin, Sagar Vaze, Ioannis Havoutis, Ingmar Posner

TL;DR
This paper introduces a novel object matching method using a large pre-trained vision-language model to improve robustness in robotic scene rearrangement, especially when source and goal images differ in object instances.
Contribution
The work presents a new cross-instance object matching approach leveraging semantics and visual features, overcoming limitations of previous methods that required identical object instances.
Findings
Significantly improved matching performance in cross-instance scenarios
Enables robot manipulation from goal images with no shared object instances
Demonstrates robustness to increased visual scene shifts
Abstract
Object rearrangement has recently emerged as a key competency in robot manipulation, with practical solutions generally involving object detection, recognition, grasping and high-level planning. Goal-images describing a desired scene configuration are a promising and increasingly used mode of instruction. A key outstanding challenge is the accurate inference of matches between objects in front of a robot, and those seen in a provided goal image, where recent works have struggled in the absence of object-specific training data. In this work, we explore the deterioration of existing methods' ability to infer matches between objects as the visual shift between observed and goal scenes increases. We find that a fundamental limitation of the current setting is that source and target images must contain the same of every object, which restricts practical deployment. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
