FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios

Xiangru Jian; Hao Xu; Wei Pang; Xinjian Zhao; Chengyu Tao; Qixin Zhang; Xikun Zhang; Chao Zhang; Guanzhi Deng; Alex Xue; Juan Du; Tianshu Yu; Garth Tarr; Linqi Song; Qiuzhuang Sun; Dacheng Tao

arXiv:2604.07413·cs.CV·April 14, 2026

FORGE: Fine-grained Multimodal Evaluation for Manufacturing Scenarios

Xiangru Jian, Hao Xu, Wei Pang, Xinjian Zhao, Chengyu Tao, Qixin Zhang, Xikun Zhang, Chao Zhang, Guanzhi Deng, Alex Xue, Juan Du, Tianshu Yu, Garth Tarr, Linqi Song, Qiuzhuang Sun, Dacheng Tao

PDF

2 Repos 2 Datasets

TL;DR

This paper introduces FORGE, a comprehensive multimodal dataset and evaluation framework for manufacturing scenarios, highlighting the importance of domain-specific knowledge over visual grounding for model performance.

Contribution

The authors created a high-quality multimodal dataset with fine-grained domain annotations and demonstrated its usefulness for evaluating and improving manufacturing MLLMs.

Findings

01

Supervised fine-tuning on FORGE data improves accuracy by up to 90.8%.

02

Visual grounding is less of a bottleneck than domain knowledge in manufacturing tasks.

03

Evaluation reveals significant performance gaps in current state-of-the-art MLLMs.

Abstract

The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.