ClevrSkills: Compositional Language and Visual Reasoning in Robotics

Sanjay Haresh; Daniel Dijkman; Apratim Bhattacharyya; Roland Memisevic

arXiv:2411.09052·cs.RO·November 15, 2024

ClevrSkills: Compositional Language and Visual Reasoning in Robotics

Sanjay Haresh, Daniel Dijkman, Apratim Bhattacharyya, Roland Memisevic

PDF

Open Access 1 Repo 1 Video

TL;DR

ClevrSkills is a new benchmark suite designed to evaluate the ability of vision-language models to perform compositional reasoning in robotics tasks, highlighting current models' limitations in this area.

Contribution

The paper introduces ClevrSkills, a comprehensive environment and dataset for testing compositional reasoning in robotics, and benchmarks existing models showing their shortcomings.

Findings

01

VLMs struggle with compositional reasoning in robotics tasks.

02

Pre-training on large datasets does not ensure success in complex reasoning.

03

ClevrSkills provides a structured curriculum for evaluating reasoning capabilities.

Abstract

Robotics tasks are highly compositional by nature. For example, to perform a high-level task like cleaning the table a robot must employ low-level capabilities of moving the effectors to the objects on the table, pick them up and then move them off the table one-by-one, while re-evaluating the consequently dynamic scenario in the process. Given that large vision language models (VLMs) have shown progress on many tasks that require high level, human-like reasoning, we ask the question: if the models are taught the requisite low-level capabilities, can they compose them in novel ways to achieve interesting high-level tasks like cleaning the table without having to be explicitly taught so? To this end, we present ClevrSkills - a benchmark suite for compositional reasoning in robotics. ClevrSkills is an environment suite developed on top of the ManiSkill2 simulator and an accompanying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Qualcomm-AI-research/ClevrSkills
pytorchOfficial

Videos

ClevrSkills: Compositional Language And Visual Reasoning in Robotics· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications