BEMEval-Doc2Schema: Benchmarking Large Language Models for Structured Data Extraction in Building Energy Modeling
Yiyuan Jia, Xiaoqin Fu, Liang Zhang

TL;DR
This paper introduces BEMEval-Doc2Schema, a benchmark framework for evaluating large language models' ability to extract structured data from building documentation, advancing automated building energy modeling research.
Contribution
It presents the first standardized benchmark with a new metric (KVOR) for assessing LLM performance in building energy modeling tasks, enabling systematic comparison of models.
Findings
Gemini 2.5 outperforms GPT-5 in structured data extraction.
Few-shot prompting improves model accuracy.
Simpler schemas yield higher KVOR scores.
Abstract
Recent advances in foundation models, including large language models (LLMs), have created new opportunities to automate building energy modeling (BEM). However, systematic evaluation has remained challenging due to the absence of publicly available, task-specific datasets and standardized performance metrics. We present BEMEval, a benchmark framework designed to assess foundation models' performance across BEM tasks. The first benchmark in this suite, BEMEval-Doc2Schema, focuses on structured data extraction from building documentation, a foundational step toward automated BEM processes. BEMEval-Doc2Schema introduces the Key-Value Overlap Rate (KVOR), a metric that quantifies the alignment between LLM-generated structured outputs and ground-truth schema references. Using this framework, we evaluate two leading models (GPT-5 and Gemini 2.5) under zero-shot and few-shot prompting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBIM and Construction Integration · Building Energy and Comfort Optimization · Energy Load and Power Forecasting
