ArtiBench and ArtiBrain: Benchmarking Generalizable Vision-Language Articulated Object Manipulation

Yuhan Wu; Tiantian Wei; Shuo Wang; ZhiChao Wang; Yanyong Zhang; Daniel Cremers; Yan Xia

arXiv:2511.20330·cs.RO·December 1, 2025

ArtiBench and ArtiBrain: Benchmarking Generalizable Vision-Language Articulated Object Manipulation

Yuhan Wu, Tiantian Wei, Shuo Wang, ZhiChao Wang, Yanyong Zhang, Daniel Cremers, Yan Xia

PDF

Open Access

TL;DR

This paper introduces ArtiBench, a comprehensive benchmark for evaluating generalizable vision-language manipulation of articulated objects, and proposes ArtiBrain, a modular framework that combines reasoning and adaptive control to improve manipulation robustness and generalization.

Contribution

We present ArtiBench, a new benchmark for structured evaluation of articulated object manipulation, and ArtiBrain, a novel framework integrating reasoning, control, and memory for enhanced generalization.

Findings

01

ArtiBrain outperforms existing methods in robustness and generalization on ArtiBench.

02

The benchmark reveals key challenges in cross-part and cross-instance manipulation.

03

The framework effectively combines high-level reasoning with low-level adaptive control.

Abstract

Interactive articulated manipulation requires long-horizon, multi-step interactions with appliances while maintaining physical consistency. Existing vision-language and diffusion-based policies struggle to generalize across parts, instances, and categories. We first introduce ArtiBench, a five-level benchmark covering kitchen, storage, office, and tool environments. ArtiBench enables structured evaluation from cross-part and cross-instance variation to long-horizon multi-object tasks, revealing the core generalization challenges of articulated object manipulation. Building on this benchmark, we propose ArtiBrain, a modular framework that unifies high-level reasoning with adaptive low-level control. ArtiBrain uses a VLM-based Task Reasoner (GPT-4.1) to decompose and validate subgoals, and employs a Hybrid Controller that combines geometry-aware keyframe execution with affordance-guided…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Robotic Path Planning Algorithms