SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

Yiyang Gu; Junwei Yang; Junyu Luo; Ye Yuan; Bin Feng; Yingce Xia; Shufang Xie; Kaili Liu; Bohan Wu; Qi Shi; Haoran Li; Beier Xiao; Zhiping Xiao; Xiao Luo; Weizhi Zhang; Philip S. Yu; Zequn Liu; Ming Zhang

arXiv:2605.19357·cs.CL·May 20, 2026

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

Yiyang Gu, Junwei Yang, Junyu Luo, Ye Yuan, Bin Feng, Yingce Xia, Shufang Xie, Kaili Liu, Bohan Wu, Qi Shi, Haoran Li, Beier Xiao, Zhiping Xiao, Xiao Luo, Weizhi Zhang, Philip S. Yu, Zequn Liu, Ming Zhang

PDF

1 Repo

TL;DR

SciCustom is a scalable framework that constructs customizable, knowledge-based benchmarks from scientific data to evaluate large language models' specific scientific capabilities.

Contribution

It introduces a novel ontology-grounded knowledge organization and retrieval method for creating fine-grained, application-specific scientific benchmarks without expert annotation.

Findings

01

Reveals fine-grained differences in LLM capabilities in chemistry and healthcare.

02

Does not require expert annotation or synthetic question generation.

03

Enables relevance-aware benchmark retrieval and efficient evaluation.

Abstract

Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yjwtheonly/SciCustom
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.