From Drawings to Decisions: A Hybrid Vision-Language Framework for Parsing 2D Engineering Drawings into Structured Manufacturing Knowledge
Muhammad Tayyab Khan, Lequn Chen, Zane Yong, Jun Ming Tan, Wenhe Feng, Seung Ki Moon

TL;DR
This paper introduces a hybrid vision-language framework combining rotation-aware object detection and transformer-based parsing to extract structured manufacturing information from complex 2D engineering drawings, improving automation and accuracy.
Contribution
The paper presents a novel pipeline integrating YOLOv11-OBB with fine-tuned vision-language models for parsing detailed annotations in engineering drawings, addressing limitations of traditional OCR methods.
Findings
Donut achieves 88.5% precision and 93.5% F1-score in parsing accuracy.
The framework effectively localizes and extracts key annotation information from complex drawings.
The approach supports downstream manufacturing tasks, demonstrating practical industrial utility.
Abstract
Efficient and accurate extraction of key information from 2D engineering drawings is essential for advancing digital manufacturing workflows. Such information includes geometric dimensioning and tolerancing (GD&T), measures, material specifications, and textual annotations. Manual extraction is slow and labor-intensive, while generic OCR models often fail due to complex layouts, engineering symbols, and rotated text, leading to incomplete and unreliable outputs. These limitations result in incomplete and unreliable outputs. To address these challenges, we propose a hybrid vision-language framework that integrates a rotation-aware object detection model (YOLOv11-obb) with a transformer-based vision-language parser. Our structured pipeline applies YOLOv11-OBB to localize annotations and extract oriented bounding box (OBB) patches, which are then parsed into structured outputs using a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
