IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation
Fan Lin, Shuyi Xie, Yong Dai, Wenlin Yao, Tianjiao Lang, Zishan Xu,, Zhichao Hu, Xiao Xiao, Yuhong Liu, Yu Zhang

TL;DR
This paper introduces IDGen, a prompt generation framework based on Item Discrimination theory, to create more challenging and discriminative evaluation datasets for assessing large language models' capabilities.
Contribution
The paper proposes a novel ID-induced prompt synthesis framework with a self-correct mechanism and predictive models, enhancing LLM evaluation by generating more discriminative prompts.
Findings
Generated prompts are more challenging and discriminative than previous datasets.
The framework effectively differentiates model performance across tasks.
A dataset of over 3,000 prompts will be released for evaluation research.
Abstract
As Large Language Models (LLMs) grow increasingly adept at managing complex tasks, the evaluation set must keep pace with these advancements to ensure it remains sufficiently discriminative. Item Discrimination (ID) theory, which is widely used in educational assessment, measures the ability of individual test items to differentiate between high and low performers. Inspired by this theory, we propose an ID-induced prompt synthesis framework for evaluating LLMs to ensure the evaluation set can continually update and refine according to model abilities. Our data synthesis framework prioritizes both breadth and specificity. It can generate prompts that comprehensively evaluate the capabilities of LLMs while revealing meaningful performance differences between models, allowing for effective discrimination of their relative strengths and weaknesses across various tasks and domains. To…
Peer Reviews
Decision·NeurIPS 2024 poster
The paper proposes a novel prompt generation method to produce more challenging evaluation data. The paper is well-structured and clearly written. The methodology and evaluation criteria are explained clearly, making the paper accessible to a broad audience.
The paper only used one LLM Hunyuan) to generalize data and did not verify whether the proposed method can generalize to other LLMs. It is debatable whether using test data generated by an LLM to evaluate the performance of LLMs has practical value. The paper lacks validation of the effectiveness of the machine-generated test set, such as comparing its metrics with those of other human-annotated datasets. The paper lacks an analysis of the diversity of the data used to produce the test set.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training
