IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

Songlin Bai; Xintong Wang; Linlin Yu; Bin Chen; Zhiang Xu; Yuyang Sheng; Changtong Zan; Xiaofeng Zhu; Yizhe Zhang; Jiru Li; Mingze Guo; Ling Zou; Yalong Li; Chengfu Huo; Liang Ding

arXiv:2605.10267·cs.AI·May 14, 2026

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

Songlin Bai, Xintong Wang, Linlin Yu, Bin Chen, Zhiang Xu, Yuyang Sheng, Changtong Zan, Xiaofeng Zhu, Yizhe Zhang, Jiru Li, Mingze Guo, Ling Zou, Yalong Li, Chengfu Huo, Liang Ding

PDF

1 Repo 1 Datasets

TL;DR

IndustryBench is a comprehensive benchmark for evaluating industrial procurement QA in multiple languages, highlighting the unreliability of current LLMs in safety-critical industrial contexts.

Contribution

The paper introduces IndustryBench, a large, standards-based benchmark with a detailed evaluation pipeline that reveals significant safety and correctness limitations of existing LLMs in industrial QA.

Findings

01

The best model scores only 2.083 out of 3, indicating substantial room for improvement.

02

Standards & Terminology is the most persistent weakness across models.

03

Safety-violation rates significantly affect model rankings, emphasizing the need for safety-aware evaluation.

Abstract

In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alibaba-multimodal-industrial-ai/IndustryBench
github

Datasets

alibaba-multimodal-industrial-ai/IndustryBench
dataset· 234 dl
234 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.