A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model

Muhammad Tayyab Khan; Zane Yong; Lequn Chen; Wenhe Feng; Nicholas Yew Jin Tan; Seung Ki Moon

arXiv:2510.21862·cs.CV·January 26, 2026

A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model

Muhammad Tayyab Khan, Zane Yong, Lequn Chen, Wenhe Feng, Nicholas Yew Jin Tan, Seung Ki Moon

PDF

TL;DR

This paper introduces a three-stage hybrid framework utilizing modern detection and vision language models to automate the interpretation of complex multi-view engineering drawings, improving accuracy and scalability.

Contribution

It presents a novel multi-stage approach combining layout detection, orientation-aware annotation detection, and OCR-free semantic parsing using vision language models for engineering drawings.

Findings

01

Alphabetical VLM achieved an F1 score of 0.672.

02

Numerical VLM achieved an F1 score of 0.963.

03

Framework enables scalable, automated interpretation of engineering drawings.

Abstract

Engineering drawings are fundamental to manufacturing communication, serving as the primary medium for conveying design intent, tolerances, and production details. However, interpreting complex multi-view drawings with dense annotations remains challenging using manual methods, generic optical character recognition (OCR) systems, or traditional deep learning approaches, due to varied layouts, orientations, and mixed symbolic-textual content. To address these challenges, this paper proposes a three-stage hybrid framework for the automated interpretation of 2D multi-view engineering drawings using modern detection and vision language models (VLMs). In the first stage, YOLOv11-det performs layout segmentation to localize key regions such as views, title blocks, and notes. The second stage uses YOLOv11-obb for orientation-aware, fine-grained detection of annotations, including measures,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.