Architecture inside the mirage: evaluating generative image models on architectural style, elements, and typologies

Jamie Magrill (1); Leah Gornstein (1); Sandra Seekins (2); Barry Magrill (2) ((1) McGill University; Montreal; Canada; (2) Capilano University; North Vancouver; Canada)

arXiv:2601.09169·cs.CV·January 15, 2026

Architecture inside the mirage: evaluating generative image models on architectural style, elements, and typologies

Jamie Magrill (1), Leah Gornstein (1), Sandra Seekins (2), Barry Magrill (2) ((1) McGill University, Montreal, Canada, (2) Capilano University, North Vancouver, Canada)

PDF

Open Access

TL;DR

This study evaluates the accuracy of five major generative AI image platforms in producing architecturally accurate images based on style, elements, and typologies, revealing limited overall accuracy and common pattern errors.

Contribution

It provides a systematic assessment of GenAI's ability to generate accurate architectural images and highlights specific limitations and recurring errors in current models.

Findings

01

Overall accuracy ranged from 32% to 52%.

02

Common errors include over-embellishment and style confusion.

03

Performance was better with common prompts than rare ones.

Abstract

Generative artificial intelligence (GenAI) text-to-image systems are increasingly used to generate architectural imagery, yet their capacity to reproduce accurate images in a historically rule-bound field remains poorly characterized. We evaluated five widely used GenAI image platforms (Adobe Firefly, DALL-E 3, Google Imagen 3, Microsoft Image Generator, and Midjourney) using 30 architectural prompts spanning styles, typologies, and codified elements. Each prompt-generator pair produced four images (n = 600 images total). Two architectural historians independently scored each image for accuracy against predefined criteria, resolving disagreements by consensus. Set-level performance was summarized as zero to four accurate images per four-image set. Image output from Common prompts was 2.7-fold more accurate than from Rare prompts (p < 0.05). Across platforms, overall accuracy was limited…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAesthetic Perception and Analysis · Image Processing and 3D Reconstruction · 3D Surveying and Cultural Heritage