Task Success Prediction for Open-Vocabulary Manipulation Based on   Multi-Level Aligned Representations

Miyu Goko; Motonari Kambara; Daichi Saito; Seitaro Otsuki; Komei; Sugiura

arXiv:2410.00436·cs.RO·October 2, 2024

Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

Miyu Goko, Motonari Kambara, Daichi Saito, Seitaro Otsuki, Komei, Sugiura

PDF

Open Access

TL;DR

This paper introduces Contrastive λ-Repformer, a novel multi-level aligned representation model that predicts task success in open-vocabulary manipulation by effectively understanding object details and changes, outperforming existing large language models.

Contribution

The paper proposes a new contrastive multi-level aligned representation model that improves task success prediction in manipulation tasks by integrating diverse features and focusing on important changes.

Findings

01

Outperforms existing multimodal large language models in accuracy.

02

Achieves an 8.66-point improvement over MLLMs on the RT-1 dataset.

03

Effective in real robot manipulation scenarios.

Abstract

In this study, we consider the problem of predicting task success for open-vocabulary manipulation by a manipulator, based on instruction sentences and egocentric images before and after manipulation. Conventional approaches, including multimodal large language models (MLLMs), often fail to appropriately understand detailed characteristics of objects and/or subtle changes in the position of objects. We propose Contrastive $λ$ -Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into a multi-level aligned representation: features that preserve local image information; features aligned with natural language; and features structured through natural language. This allows the model to focus on important changes by looking at the differences in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning

MethodsFocus