LEGO Co-builder: Exploring Fine-Grained Vision-Language Modeling for Multimodal LEGO Assembly Assistants

Haochen Huang; Jiahuan Pei; Mohammad Aliannejadi; Xin Sun; Moonisa Ahsan; Chuang Yu; Zhaochun Ren; Pablo Cesar; Junxiao Wang

arXiv:2507.05515·cs.AI·July 24, 2025

LEGO Co-builder: Exploring Fine-Grained Vision-Language Modeling for Multimodal LEGO Assembly Assistants

Haochen Huang, Jiahuan Pei, Mohammad Aliannejadi, Xin Sun, Moonisa Ahsan, Chuang Yu, Zhaochun Ren, Pablo Cesar, Junxiao Wang

PDF

Open Access

TL;DR

This paper introduces LEGO Co-builder, a benchmark for evaluating vision-language models on fine-grained LEGO assembly tasks, revealing current models' limitations in detailed spatial and state understanding.

Contribution

It presents a new hybrid benchmark dataset and evaluates leading models, highlighting gaps in fine-grained multimodal understanding for assembly tasks.

Findings

01

GPT-4o achieves a maximum F1 score of 40.54% in state detection.

02

Leading models struggle with fine-grained spatial reasoning.

03

The benchmark and tools are publicly released for future research.

Abstract

Vision-language models (VLMs) are facing the challenges of understanding and following multimodal assembly instructions, particularly when fine-grained spatial reasoning and precise object state detection are required. In this work, we explore LEGO Co-builder, a hybrid benchmark combining real-world LEGO assembly logic with programmatically generated multimodal scenes. The dataset captures stepwise visual states and procedural instructions, allowing controlled evaluation of instruction-following, object detection, and state detection. We introduce a unified framework and assess leading VLMs such as GPT-4o, Gemini, and Qwen-VL, under zero-shot and fine-tuned settings. Our results reveal that even advanced models like GPT-4o struggle with fine-grained assembly tasks, with a maximum F1 score of just 40.54\% on state detection, highlighting gaps in fine-grained visual understanding. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications