AEC-Bench: A Multimodal Benchmark for Agentic Systems in Architecture, Engineering, and Construction
Harsh Mankodiya, Chase Gallik, Theodoros Galanos, Andriy Mulyar

TL;DR
AEC-Bench is a comprehensive multimodal benchmark designed to evaluate agentic systems in real-world architecture, engineering, and construction tasks, promoting consistent performance improvements and open research practices.
Contribution
It introduces a new benchmark dataset, evaluation protocol, and baseline results for assessing foundation models in AEC-specific tasks, with openly available code and data.
Findings
Baseline models show consistent performance improvements with specific tools and harness design techniques.
The benchmark covers tasks like drawing understanding, cross-sheet reasoning, and project coordination.
Open release of dataset, code, and agent harness facilitates reproducibility and further research.
Abstract
The AEC-Bench is a multimodal benchmark for evaluating agentic systems on real-world tasks in the Architecture, Engineering, and Construction (AEC) domain. The benchmark covers tasks requiring drawing understanding, cross-sheet reasoning, and construction project-level coordination. This report describes the benchmark motivation, dataset taxonomy, evaluation protocol, and baseline results across several domain-specific foundation model harnesses. We use AEC-Bench to identify consistent tools and harness design techniques that uniformly improve performance across foundation models in their own base harnesses, such as Claude Code and Codex. We openly release our benchmark dataset, agent harness, and evaluation code for full replicability at https://github.com/nomic-ai/aec-bench under an Apache 2 license.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
