Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy
Si Chen, Le Huy Khiem, Annalisa Szymanski, Ronald Metoyer, Ting Hua, Nitesh V. Chawla

TL;DR
This paper presents a framework for automatically generating benchmarks from domain guidelines, using Bloom's Taxonomy to evaluate large language models' reasoning across real-world, practice-based domains.
Contribution
It introduces a scalable, reproducible method to create detailed, domain-specific benchmarks from expert guidelines, enabling better assessment of reasoning in practical settings.
Findings
LLMs perform better on higher-order reasoning (Analyze)
LLMs struggle with lower-level items (Remember)
Generated benchmarks reveal non-intuitive model behaviors
Abstract
Open-ended question answering (QA) evaluates a model's ability to perform contextualized reasoning beyond factual recall. This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment, while most existing LLM benchmarks depend on pre-existing human exam datasets that are often unavailable in such settings. We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom's Taxonomy. It converts expert practices into implicit violation-based scenarios and expands them into auto-graded multiple-choice questions (MCQs) and multi-turn dialogues across four cognitive levels, enabling deterministic, reproducible, and scalable evaluation. Applied to three applied domains: teaching, dietetics, and caregiving, we find differences between model and human-like reasoning: LLMs sometimes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Multimodal Machine Learning Applications
