ClevrSkills: Compositional Language and Visual Reasoning in Robotics
Sanjay Haresh, Daniel Dijkman, Apratim Bhattacharyya, Roland Memisevic

TL;DR
ClevrSkills is a new benchmark suite designed to evaluate the ability of vision-language models to perform compositional reasoning in robotics tasks, highlighting current models' limitations in this area.
Contribution
The paper introduces ClevrSkills, a comprehensive environment and dataset for testing compositional reasoning in robotics, and benchmarks existing models showing their shortcomings.
Findings
VLMs struggle with compositional reasoning in robotics tasks.
Pre-training on large datasets does not ensure success in complex reasoning.
ClevrSkills provides a structured curriculum for evaluating reasoning capabilities.
Abstract
Robotics tasks are highly compositional by nature. For example, to perform a high-level task like cleaning the table a robot must employ low-level capabilities of moving the effectors to the objects on the table, pick them up and then move them off the table one-by-one, while re-evaluating the consequently dynamic scenario in the process. Given that large vision language models (VLMs) have shown progress on many tasks that require high level, human-like reasoning, we ask the question: if the models are taught the requisite low-level capabilities, can they compose them in novel ways to achieve interesting high-level tasks like cleaning the table without having to be explicitly taught so? To this end, we present ClevrSkills - a benchmark suite for compositional reasoning in robotics. ClevrSkills is an environment suite developed on top of the ManiSkill2 simulator and an accompanying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
