AECV-Bench: Benchmarking Multimodal Models on Architectural and Engineering Drawings Understanding

Aleksei Kondratenko; Mussie Birhane; Houssame E. Hsain; Guido Maciocci

arXiv:2601.04819·cs.AI·January 9, 2026

AECV-Bench: Benchmarking Multimodal Models on Architectural and Engineering Drawings Understanding

Aleksei Kondratenko, Mussie Birhane, Houssame E. Hsain, Guido Maciocci

PDF

Open Access

TL;DR

This paper introduces AECV-Bench, a comprehensive benchmark to evaluate multimodal models on architectural and engineering drawings, revealing strengths in text extraction but weaknesses in spatial and symbol understanding.

Contribution

The paper presents AECV-Bench, a new benchmark for assessing multimodal models on AEC drawings, including object counting and document QA tasks, highlighting current model limitations.

Findings

01

High accuracy in OCR and text-based QA (up to 0.95)

02

Moderate performance in spatial reasoning tasks

03

Poor performance in symbol-centric counting (often 0.40-0.55 accuracy)

Abstract

AEC drawings encode geometry and semantics through symbols, layout conventions, and dense annotation, yet it remains unclear whether modern multimodal and vision-language models can reliably interpret this graphical language. We present AECV-Bench, a benchmark for evaluating multimodal and vision-language models on realistic AEC artefacts via two complementary use cases: (i) object counting on 120 high-quality floor plans (doors, windows, bedrooms, toilets), and (ii) drawing-grounded document QA spanning 192 question-answer pairs that test text extraction (OCR), instance counting, spatial reasoning, and comparative reasoning over common drawing regions. Object-counting performance is reported using per-field exact-match accuracy and MAPE results, while document-QA performance is reported using overall accuracy and per-category breakdowns with an LLM-as-a-judge scoring pipeline and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Multimodal Machine Learning Applications · Data Visualization and Analytics