Real2Code: Reconstruct Articulated Objects via Code Generation
Zhao Mandi, Yijia Weng, Dominik Bauer, Shuran Song

TL;DR
Real2Code introduces a scalable method that uses vision and language models to reconstruct complex articulated objects from images by generating code representing their joint structures, outperforming previous methods.
Contribution
The paper presents a novel approach combining vision and language models to reconstruct articulated objects with up to 10 parts from images, including real-world multi-view data, surpassing prior accuracy.
Findings
Outperforms previous state-of-the-art in reconstruction accuracy.
Capable of reconstructing objects with up to 10 articulated parts.
Generalizes from synthetic to real-world objects using minimal multi-view images.
Abstract
We present Real2Code, a novel approach to reconstructing articulated objects via code generation. Given visual observations of an object, we first reconstruct its part geometry using an image segmentation model and a shape completion model. We then represent the object parts with oriented bounding boxes, which are input to a fine-tuned large language model (LLM) to predict joint articulation as code. By leveraging pre-trained vision and language models, our approach scales elegantly with the number of articulated parts, and generalizes from synthetic training data to real world objects in unstructured environments. Experimental results demonstrate that Real2Code significantly outperforms previous state-of-the-art in reconstruction accuracy, and is the first approach to extrapolate beyond objects' structural complexity in the training set, and reconstructs objects with up to 10…
Peer Reviews
Decision·ICLR 2025 Poster
In my opinion, below are the strengths of the paper: 1. The paper scales well to increased number of joints. This has been a major limitation of preceding works and this work address it nicely with a modular approach i.e, part level reconstruction and code-gen integration for the subsequent steps. 2. Strong quantitative improvement numbers compared to recent state-of-the-art baselines, especially for increasing number of parts. 3. The presentation of the paper in nice and paper writing is e
I have some question to the authors. In my opinion, below are the paper's weaknesses: 1. Why does the PARIS baseline struggle a lot? even for 2-part case? Did the authors try to tune their method? Based on the PARIS results' from the paper, it looks like it should reasonably work well for a simpler 2 part setting? 2. Despite good qualitative results, why are the resutls only shown on simpler objects like cupboards and laptop? Does the method work for varied articulated objects like scissors, s
1. This method formulates joint prediction as a code generation problem, which is different from prior work. The biggest advantage of such a formulation is the ability to scale well with different numbers of parts (prior work works mostly for objects with <=3 parts). 2. The overall pipeline is novel -- it leverages a few different modules including Vision models for part segmentation and completion as well as LLM for code generation. This way, the problem is decomposed into a few smaller steps
1. In Sec. 4.2.1, it mentions that "we generate permutations of the set of predicted meshes and take the permutation that results in lowest error; the same logic is used for joint prediction results". I was wondering why this is needed to evaluate this method. Is it because the proposed method is not very stable? How much more time would this cost for the inference of this method? 2. The link to more visualizations included in Sec. 4.4 does not contain any result visualizations -- it seems it o
1. The paper proposes a new pipeline Real2Code to reconstruct articulated shapes from images. It shows promising results on multiple categories with different joint types in the PartNet-Mobility dataset. 2. I find its generalization ability to multiple-joint shapes particularly interesting, which could potentially enable many real-world robot manipulation tasks. 3. The paper is overall easy to read.
1. The proposed pipeline consists of multiple components and, as a result, rather fragile from what I understand, since a failure in any component in the middle can cause the entire pipeline to break down. For example, if the part bounding boxes parameters (segmentation or shape completion) are inaccurate, the joint prediction part will carry these errors. Since the whole procedure is open-loop, I wonder if the method still produces reasonable shapes assuming initial bounding box predictions are
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Image Processing and 3D Reconstruction · Natural Language Processing Techniques
