Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy

Si Chen; Le Huy Khiem; Annalisa Szymanski; Ronald Metoyer; Ting Hua; Nitesh V. Chawla

arXiv:2601.20253·cs.CL·January 29, 2026

Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy

Si Chen, Le Huy Khiem, Annalisa Szymanski, Ronald Metoyer, Ting Hua, Nitesh V. Chawla

PDF

Open Access

TL;DR

This paper presents a framework for automatically generating benchmarks from domain guidelines, using Bloom's Taxonomy to evaluate large language models' reasoning across real-world, practice-based domains.

Contribution

It introduces a scalable, reproducible method to create detailed, domain-specific benchmarks from expert guidelines, enabling better assessment of reasoning in practical settings.

Findings

01

LLMs perform better on higher-order reasoning (Analyze)

02

LLMs struggle with lower-level items (Remember)

03

Generated benchmarks reveal non-intuitive model behaviors

Abstract

Open-ended question answering (QA) evaluates a model's ability to perform contextualized reasoning beyond factual recall. This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment, while most existing LLM benchmarks depend on pre-existing human exam datasets that are often unavailable in such settings. We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom's Taxonomy. It converts expert practices into implicit violation-based scenarios and expands them into auto-graded multiple-choice questions (MCQs) and multi-turn dialogues across four cognitive levels, enabling deterministic, reproducible, and scalable evaluation. Applied to three applied domains: teaching, dietetics, and caregiving, we find differences between model and human-like reasoning: LLMs sometimes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Multimodal Machine Learning Applications