IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation

Fan Lin; Shuyi Xie; Yong Dai; Wenlin Yao; Tianjiao Lang; Zishan Xu,; Zhichao Hu; Xiao Xiao; Yuhong Liu; Yu Zhang

arXiv:2409.18892·cs.CL·October 8, 2024

IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation

Fan Lin, Shuyi Xie, Yong Dai, Wenlin Yao, Tianjiao Lang, Zishan Xu,, Zhichao Hu, Xiao Xiao, Yuhong Liu, Yu Zhang

PDF

Open Access 1 Repo 1 Video 1 Reviews

TL;DR

This paper introduces IDGen, a prompt generation framework based on Item Discrimination theory, to create more challenging and discriminative evaluation datasets for assessing large language models' capabilities.

Contribution

The paper proposes a novel ID-induced prompt synthesis framework with a self-correct mechanism and predictive models, enhancing LLM evaluation by generating more discriminative prompts.

Findings

01

Generated prompts are more challenging and discriminative than previous datasets.

02

The framework effectively differentiates model performance across tasks.

03

A dataset of over 3,000 prompts will be released for evaluation research.

Abstract

As Large Language Models (LLMs) grow increasingly adept at managing complex tasks, the evaluation set must keep pace with these advancements to ensure it remains sufficiently discriminative. Item Discrimination (ID) theory, which is widely used in educational assessment, measures the ability of individual test items to differentiate between high and low performers. Inspired by this theory, we propose an ID-induced prompt synthesis framework for evaluating LLMs to ensure the evaluation set can continually update and refine according to model abilities. Our data synthesis framework prioritizes both breadth and specificity. It can generate prompts that comprehensively evaluate the capabilities of LLMs while revealing meaningful performance differences between models, allowing for effective discrimination of their relative strengths and weaknesses across various tasks and domains. To…

Peer Reviews

Decision·NeurIPS 2024 poster

Reviewer 01Rating 4Confidence 2

Strengths

The paper proposes a novel prompt generation method to produce more challenging evaluation data. The paper is well-structured and clearly written. The methodology and evaluation criteria are explained clearly, making the paper accessible to a broad audience.

Weaknesses

The paper only used one LLM Hunyuan) to generalize data and did not verify whether the proposed method can generalize to other LLMs. It is debatable whether using test data generated by an LLM to evaluate the performance of LLMs has practical value. The paper lacks validation of the effectiveness of the machine-generated test set, such as comparing its metrics with those of other human-annotated datasets. The paper lacks an analysis of the diversity of the data used to produce the test set.

Code & Models

Repositories

DUTlf/IDGen
noneOfficial

Videos

IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSparse Evolutionary Training