ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning
Zhao Jin, Zhengping Che, Tao Li, Zhen Zhao, Kun Wu, Yuheng Zhang, Yinuo Zhao, Zehui Liu, Qiang Zhang, Xiaozhu Ju, Jing Tian, Yousong Xue, Jian Tang

TL;DR
ArtVIP introduces a high-quality, open-source dataset of digitally realistic, physically accurate articulated objects with modular interactions, enhancing robot learning simulation and bridging the gap to real-world applications.
Contribution
It provides a comprehensive, realistic, and physically faithful dataset with embedded interaction behaviors and annotations, addressing limitations of existing datasets for robot training.
Findings
Demonstrated improved sim-to-real transfer in robot learning tasks.
Validated dataset's visual and physical fidelity through feature-map visualization and motion capture.
Enabled effective imitation and reinforcement learning experiments.
Abstract
Robot learning increasingly relies on simulation to advance complex ability such as dexterous manipulations and precise interactions, necessitating high-quality digital assets to bridge the sim-to-real gap. However, existing open-source articulated-object datasets for simulation are limited by insufficient visual realism and low physical fidelity, which hinder their utility for training models mastering robotic tasks in real world. To address these challenges, we introduce ArtVIP, a comprehensive open-source dataset comprising high-quality digital-twin articulated objects, accompanied by indoor-scene assets. Crafted by professional 3D modelers adhering to unified standards, ArtVIP ensures visual realism through precise geometric meshes and high-resolution textures, while physical fidelity is achieved via fine-tuned dynamic parameters. Meanwhile, the dataset pioneers embedded modular…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper makes a valuable and practical contribution by releasing a high-quality articulated object dataset that combines visual realism, physical accuracy, and modular interactions. The modeling pipeline and embedded behaviors are clearly documented, ensuring long-term usefulness for the robotics community. 2. The dataset’s open-source release in USD format, along with conversion tools (URDF/MJCF) and comprehensive production guidelines, greatly improves accessibility, reproducibility, and
1. The primary contribution lies in dataset engineering rather than methodological novelty. While the dataset’s quality is commendable, its scalability is constrained by manual modeling and tuning, which may limit extensibility. With the rise of generative pipelines such as RoboTwin[1] and Genesis[2], ArtVIP’s labor-intensive approach appears less sustainable for expansion. 2. The claimed physical fidelity mainly covers joint parameters such as damping, friction, and magnetic closure, but overl
In my opinion, below are the main strengths of the paper: 1. A high-quality 3D articulated object datasets combined with scene-level information i.e. kitchen etc. which exhibit greater photorealism and physical intractability. 2. Experiments are diverse and cover a breadth of tasks such as evaluating photorealism, interactability, reconstruction performance evaluation, feature distribution analysis as well as downstream application to imitation learning and RL. 3. The paper is nicely written
In my opinion, below are the main weakness in the paper: 1. While the qualitative results do show higher quality assets, it's unclear how well these fair when compared to other low-effort feed-forward approaches [1,2,3,4]. A comparison to these feed-forward baselines for the experiments outlines in the paper would justify the time spent in creating the higher quality assets where low-effort approaches sometime run at 1Hz for eight or less objects from a single RGB-D image [2]. 2. While qualit
- The overall paper is well written and motivated. It is clear what the goal of the paper is and why other approaches/datasets do not fulfill the requirements described in ArtVIP - The dataset is crafted by experts following a specific assembly guideline, which should ensure that the objects are of high quality - The dataset includes almost 1000 different assets which are articulated, as well as specific scenes and pixel-level annotations - The evaluation of the dataset is thorough and includes
- Line 187-189: Any explicit source or statistic that confirms this? - Visualized Feature Distribution: It is not clear from figure 5 on the right that ArtVIP object embeddings are actually that much closer to real world object embeddings. It looks more like they are still apart and ArtVIP is more closley related to OmniGibson. I think some other form of measurement for the feature distribution would be necessary. - Claim (1) in line 430: Can you provide experiments using simulated data from oth
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Robotics and Sensor-Based Localization · Social Robot Interaction and HRI
