ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation

Enyu Zhao; Vedant Raval; Hejia Zhang; Jiageng Mao; Zeyu Shangguan; Stefanos Nikolaidis; Yue Wang; Daniel Seita

arXiv:2505.09698·cs.RO·September 3, 2025

ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation

Enyu Zhao, Vedant Raval, Hejia Zhang, Jiageng Mao, Zeyu Shangguan, Stefanos Nikolaidis, Yue Wang, Daniel Seita

PDF

Open Access

TL;DR

ManipBench introduces a comprehensive benchmark to evaluate vision-language models' low-level reasoning abilities in robotic manipulation, revealing significant performance variability and gaps compared to human understanding.

Contribution

The paper presents ManipBench, the first standardized benchmark for assessing VLMs' low-level manipulation reasoning in robotics, tested across 33 models and multiple manipulation tasks.

Findings

01

VLM performance varies widely across tasks

02

Strong correlation between benchmark scores and real-world tasks

03

Significant gap remains between models and human-level understanding

Abstract

Vision-Language Models (VLMs) have revolutionized artificial intelligence and robotics due to their commonsense reasoning capabilities. In robotic manipulation, VLMs are used primarily as high-level planners, but recent work has also studied their lower-level reasoning ability, which refers to making decisions about precise robot movements. However, the community currently lacks a clear and common benchmark that can evaluate how well VLMs can aid low-level reasoning in robotics. Consequently, we propose a novel benchmark, ManipBench, to evaluate the low-level robot manipulation reasoning capabilities of VLMs across various dimensions, including how well they understand object-object interactions and deformable object manipulation. We extensively test 33 representative VLMs across 10 model families on our benchmark, including variants to test different model sizes. Our evaluation shows…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Robot Manipulation and Learning