Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

Zhuchenyang Liu; Yao Zhang; Yu Xiao

arXiv:2604.00913·cs.CV·April 2, 2026

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

Zhuchenyang Liu, Yao Zhang, Yu Xiao

PDF

2 Repos 1 Datasets

TL;DR

This paper evaluates vision-language models on a new benchmark for understanding assembly instructions across different visual depictions, revealing key challenges and guiding future improvements.

Contribution

It introduces IKEA-Bench, a comprehensive benchmark for cross-depiction assembly instruction understanding, and provides a mechanistic analysis of model behaviors and limitations.

Findings

01

Text improves instruction understanding but hinders diagram-video alignment.

02

Model architecture influences alignment accuracy more than size.

03

Video understanding remains a significant bottleneck.

Abstract

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Ryenhails/ikea-bench
dataset· 355 dl
355 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.