From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise
Nitin Sharma, Thomas Wolfers, \c{C}a\u{g}atay Y{\i}ld{\i}z

TL;DR
This paper introduces an automated, unbiased pipeline for creating domain-specific benchmarks from raw corpora to evaluate LLMs' domain knowledge without relying on other LLMs or human annotation.
Contribution
The authors present a novel deterministic method to generate domain benchmarks directly from raw data, enabling scalable, fair, and up-to-date evaluation of LLMs' domain expertise.
Findings
Model performance correlates with expert benchmarks.
Benchmark enables analysis of knowledge acquisition.
Evaluation framework compares base and chat models.
Abstract
Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education. However, existing benchmarks are documented to be contaminated and are based on multiple-choice questions, which suffer from inherent biases. To measure domain-specific knowledge in LLMs, we present a deterministic pipeline that transforms raw domain corpora into completion-style benchmarks without relying on other LLMs or costly human annotation. Our approach first extracts domain-specific keywords and related target vocabulary from an input corpus. It then constructs prompt-target pairs where domain-specific words serve as prediction targets. By measuring LLMs' ability to complete these prompts, we provide a direct assessment of domain knowledge at low computational cost. Our pipeline avoids benchmark contamination, enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Topic Modeling · Natural Language Processing Techniques
