From Drawings to Decisions: A Hybrid Vision-Language Framework for Parsing 2D Engineering Drawings into Structured Manufacturing Knowledge

Muhammad Tayyab Khan; Lequn Chen; Zane Yong; Jun Ming Tan; Wenhe Feng; Seung Ki Moon

arXiv:2506.17374·cs.CV·September 30, 2025

From Drawings to Decisions: A Hybrid Vision-Language Framework for Parsing 2D Engineering Drawings into Structured Manufacturing Knowledge

Muhammad Tayyab Khan, Lequn Chen, Zane Yong, Jun Ming Tan, Wenhe Feng, Seung Ki Moon

PDF

TL;DR

This paper introduces a hybrid vision-language framework combining rotation-aware object detection and transformer-based parsing to extract structured manufacturing information from complex 2D engineering drawings, improving automation and accuracy.

Contribution

The paper presents a novel pipeline integrating YOLOv11-OBB with fine-tuned vision-language models for parsing detailed annotations in engineering drawings, addressing limitations of traditional OCR methods.

Findings

01

Donut achieves 88.5% precision and 93.5% F1-score in parsing accuracy.

02

The framework effectively localizes and extracts key annotation information from complex drawings.

03

The approach supports downstream manufacturing tasks, demonstrating practical industrial utility.

Abstract

Efficient and accurate extraction of key information from 2D engineering drawings is essential for advancing digital manufacturing workflows. Such information includes geometric dimensioning and tolerancing (GD&T), measures, material specifications, and textual annotations. Manual extraction is slow and labor-intensive, while generic OCR models often fail due to complex layouts, engineering symbols, and rotated text, leading to incomplete and unreliable outputs. These limitations result in incomplete and unreliable outputs. To address these challenges, we propose a hybrid vision-language framework that integrates a rotation-aware object detection model (YOLOv11-obb) with a transformer-based vision-language parser. Our structured pipeline applies YOLOv11-OBB to localize annotations and extract oriented bounding box (OBB) patches, which are then parsed into structured outputs using a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.