Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations
Miyu Goko, Motonari Kambara, Daichi Saito, Seitaro Otsuki, Komei, Sugiura

TL;DR
This paper introduces Contrastive λ-Repformer, a novel multi-level aligned representation model that predicts task success in open-vocabulary manipulation by effectively understanding object details and changes, outperforming existing large language models.
Contribution
The paper proposes a new contrastive multi-level aligned representation model that improves task success prediction in manipulation tasks by integrating diverse features and focusing on important changes.
Findings
Outperforms existing multimodal large language models in accuracy.
Achieves an 8.66-point improvement over MLLMs on the RT-1 dataset.
Effective in real robot manipulation scenarios.
Abstract
In this study, we consider the problem of predicting task success for open-vocabulary manipulation by a manipulator, based on instruction sentences and egocentric images before and after manipulation. Conventional approaches, including multimodal large language models (MLLMs), often fail to appropriately understand detailed characteristics of objects and/or subtle changes in the position of objects. We propose Contrastive -Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into a multi-level aligned representation: features that preserve local image information; features aligned with natural language; and features structured through natural language. This allows the model to focus on important changes by looking at the differences in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning
MethodsFocus
