Benchmarking for Domain-Specific LLMs: A Case Study on Academia and Beyond
Rubing Chen, Jiaxin Wu, Jian Wang, Xulu Zhang, Wenqi Fan, Chenghua Lin, Xiao-Yong Wei, Qing Li

TL;DR
This paper introduces the Comp-Comp benchmarking framework, emphasizing comprehensiveness and compactness over data scaling, to create high-quality, domain-specific benchmarks like PolyBench for academia, improving evaluation precision and recall.
Contribution
The paper proposes a novel iterative benchmarking framework based on comprehensiveness and compactness, challenging the reliance on data scaling for domain-specific LLM evaluation.
Findings
PolyBench is a large-scale academic benchmark created using the framework.
Comp-Comp improves the precision and recall of domain-specific LLM evaluation.
The framework is adaptable to various specialized fields.
Abstract
The increasing demand for domain-specific evaluation of large language models (LLMs) has led to the development of numerous benchmarks. These efforts often adhere to the principle of data scaling, relying on large corpora or extensive question-answer (QA) sets to ensure broad coverage. However, the impact of corpus and QA set design on the precision and recall of domain-specific LLM performance remains poorly understood. In this paper, we argue that data scaling is not always the optimal principle for domain-specific benchmark construction. Instead, we introduce Comp-Comp, an iterative benchmarking framework grounded in the principle of comprehensiveness and compactness. Comprehensiveness ensures semantic recall by covering the full breadth of the domain, while compactness improves precision by reducing redundancy and noise. To demonstrate the effectiveness of our approach, we present a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
