Open-Vocabulary Mobile Manipulation Based on Double Relaxed Contrastive Learning with Dense Labeling
Daichi Yashima, Ryosuke Korekata, Komei Sugiura

TL;DR
This paper introduces RelaX-Former, a novel contrastive learning approach for open-vocabulary object retrieval in domestic service robots, enabling accurate object handling based on complex natural language instructions.
Contribution
We propose RelaX-Former, a new contrastive learning method that improves image retrieval for robot manipulation using diverse positive and negative samples, enhancing zero-shot performance.
Findings
RelaX-Former outperforms baseline models on indoor image retrieval metrics.
Achieved 75% success rate in real-world robot object transfer tasks.
Effective in zero-shot transfer scenarios with complex instructions.
Abstract
Growing labor shortages are increasing the demand for domestic service robots (DSRs) to assist in various settings. In this study, we develop a DSR that transports everyday objects to specified pieces of furniture based on open-vocabulary instructions. Our approach focuses on retrieving images of target objects and receptacles from pre-collected images of indoor environments. For example, given an instruction "Please get the right red towel hanging on the metal towel rack and put it in the white washing machine on the left," the DSR is expected to carry the red towel to the washing machine based on the retrieved images. This is challenging because the correct images should be retrieved from thousands of collected images, which may include many images of similar towels and appliances. To address this, we propose RelaX-Former, which learns diverse and robust representations from among…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications
Methodstravel james
