How to Get Your LLM to Generate Challenging Problems for Evaluation

Arkil Patel; Siva Reddy; Dzmitry Bahdanau

arXiv:2502.14678·cs.CL·February 21, 2025

How to Get Your LLM to Generate Challenging Problems for Evaluation

Arkil Patel, Siva Reddy, Dzmitry Bahdanau

PDF

Open Access 1 Repo 3 Datasets 3 Reviews

TL;DR

This paper presents CHASE, a framework that automatically generates challenging evaluation problems for LLMs across multiple domains, reducing reliance on human annotation and improving assessment rigor.

Contribution

CHASE is a novel, domain-agnostic framework that synthetically creates high-quality, challenging problems for LLM evaluation without human involvement.

Findings

01

Generated benchmarks are challenging for current LLMs, with 40-60% accuracy.

02

Framework ensures problem quality through sub-task decomposition.

03

Benchmarks are publicly available for research use.

Abstract

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 4

Strengths

--The paper is written at an impressive quality, especially the figures and the elucidation of the problem's motivation and challenges. --The authors consider three sufficiently diverse tasks and benchmarks to showcase the utility of their approach. --The results are fairly compelling, and the benchmark indeed succeeds in yielding performance drops even from advanced models.

Weaknesses

--Experimental results could have been deeper in the main text. It is for this reason that I am not inclined to give the paper a stellar rating. --The approach is simple and has some nice properties, but I am not too sure about its sensitivity and robustness. I felt inadequate attention was paid to this in the paper.

Reviewer 02Rating 3Confidence 4

Strengths

- The problem addressed by this paper is critical to the evaluation of current LLMs -- the lack of comprehensive and challenging datasets. - The paper is well-structured, with comprehensive appendices, such as a detailed list of prompts used in CHASE. - This paper presents a novel paradigm for data construction, which may have significant potential in the field of synthetic data.

Weaknesses

- Some issues with the details of the paper. For example, in the main figure (Figure 1), the bottom-right corner should say "12 pens" instead of "18 pens." - The current dataset is relatively small, which may result in a high degree of randomness in evaluation results when using this dataset. - The experiments are not sufficiently thorough. Some experimental designs lack strong motivation, and there is a lack of experiments that demonstrate the advantage of CHASE over other synthetic data genera

Reviewer 03Rating 3Confidence 4

Strengths

1. The experiments are comprehensive, with a good set of LLMs covering representative proprietary and open-source models. 2. The paper is well-written, which clearly describes the methods, experiments and results.

Weaknesses

1. Although overall I believe it is valuable to explore data synthesis for benchmark construction, I think the authors should be more careful in selecting appropriate settings. I think the most important motivation for this paper is that it is expensive and sometimes impracticable to create benchmarks with challenging problems. However, in some settings present in the paper, I feel that this may not be the case. For example, SWE-bench [1] also focuses on repo-level code generation, and they take

Code & Models

Repositories

mcgill-nlp/chase
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law