How to Get Your LLM to Generate Challenging Problems for Evaluation
Arkil Patel, Siva Reddy, Dzmitry Bahdanau

TL;DR
This paper presents CHASE, a framework that automatically generates challenging evaluation problems for LLMs across multiple domains, reducing reliance on human annotation and improving assessment rigor.
Contribution
CHASE is a novel, domain-agnostic framework that synthetically creates high-quality, challenging problems for LLM evaluation without human involvement.
Findings
Generated benchmarks are challenging for current LLMs, with 40-60% accuracy.
Framework ensures problem quality through sub-task decomposition.
Benchmarks are publicly available for research use.
Abstract
The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems. In this work, we introduce CHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: (1) document-based question answering, (2) repository-level code completion, and (3) math reasoning. The performance of state-of-the-art LLMs on…
Peer Reviews
Decision·Submitted to ICLR 2025
--The paper is written at an impressive quality, especially the figures and the elucidation of the problem's motivation and challenges. --The authors consider three sufficiently diverse tasks and benchmarks to showcase the utility of their approach. --The results are fairly compelling, and the benchmark indeed succeeds in yielding performance drops even from advanced models.
--Experimental results could have been deeper in the main text. It is for this reason that I am not inclined to give the paper a stellar rating. --The approach is simple and has some nice properties, but I am not too sure about its sensitivity and robustness. I felt inadequate attention was paid to this in the paper.
- The problem addressed by this paper is critical to the evaluation of current LLMs -- the lack of comprehensive and challenging datasets. - The paper is well-structured, with comprehensive appendices, such as a detailed list of prompts used in CHASE. - This paper presents a novel paradigm for data construction, which may have significant potential in the field of synthetic data.
- Some issues with the details of the paper. For example, in the main figure (Figure 1), the bottom-right corner should say "12 pens" instead of "18 pens." - The current dataset is relatively small, which may result in a high degree of randomness in evaluation results when using this dataset. - The experiments are not sufficiently thorough. Some experimental designs lack strong motivation, and there is a lack of experiments that demonstrate the advantage of CHASE over other synthetic data genera
1. The experiments are comprehensive, with a good set of LLMs covering representative proprietary and open-source models. 2. The paper is well-written, which clearly describes the methods, experiments and results.
1. Although overall I believe it is valuable to explore data synthesis for benchmark construction, I think the authors should be more careful in selecting appropriate settings. I think the most important motivation for this paper is that it is expensive and sometimes impracticable to create benchmarks with challenging problems. However, in some settings present in the paper, I feel that this may not be the case. For example, SWE-bench [1] also focuses on repo-level code generation, and they take
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
