InfoSynth: Information-Guided Benchmark Synthesis for LLMs
Ishir Garg, Neel Kolhe, Xuandong Zhao, Dawn Song

TL;DR
InfoSynth introduces an information-theoretic framework and automated pipeline for generating diverse, novel, and high-quality reasoning benchmarks for LLM evaluation, reducing manual effort and avoiding data contamination.
Contribution
It presents a novel, automated method for synthesizing reasoning benchmarks guided by information theory, enabling scalable and controllable creation of diverse evaluation datasets.
Findings
Achieves 97% accuracy in generating test cases and solutions.
Synthesized benchmarks show higher novelty and diversity than seed datasets.
Provides a scalable pipeline for high-quality benchmark creation.
Abstract
Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from…
Peer Reviews
Decision·Submitted to ICLR 2026
1. KL-divergence for novelty and entropy for diversity measure make sense, and are computationally efficient compared to model based evaluations. 2. This work includes several important ablation studies that validate key design choices: - multiple mutation difficulties provide difficulty control - iterative feedback substantially improves correctness - postprocessing improves clarity - k-farthest neighbor filtering increases diversity
1. The paper sets the goal of creating benchmarks that are contamination-free, while lacking the experiments or discussion of contamination evaluation on the proposed data generation method. 2. No comparison with other synthetic benchmark generation methods. Section 2 actually listed multiple comparable methods, but none of them were benchmarked in empirical study. Ideally the authors should show, with different dataset generation methods, compare novelty / diversity metrics, correctness rates,
1. The experiment covers a wide range of LLMs, including open-weight models like Qwen, and SOTA models like GPT and Claude. 2. The implementation details are presented in detail. The prompt is also provided, so I think the reproducibility is good. 3. This paper is generally well-written and easy to follow.
1. **Whether KL can reflect the novelty of a dataset is questionable.** First, this paper calculates the KL Divergence of the problem statements in the dataset based on the embeddings output by a very small model. I believe this can at best only measure the differences between problem statements at the literal level, rather than truly capturing the novelty of these problems in their essence. For instance, if I add a large section of irrelevant background stories to each problem in a dataset, the
1. The paper repurposes KL-Divergence and differential entropy to measure benchmark novelty and diversity. This is a creative contribution to the benchmark synthesis domain. 2. Section 3 effectively introduces the desirable properties (novelty and diversity) and provides both theoretical foundation and empirical validation with 95% confidence intervals. 3. The combination of genetic algorithms, mutation/crossover operations, and iterative code feedback is well-designed.
1. Limited Metric Validation Scope. The KL-Divergence and entropy validations primarily compare Leetcode subsets or extract subsets from larger benchmarks. While these experiments demonstrate the metrics work in clear-cut cases, *validation on independent benchmarks with subtle differences is missing. For example, comparing HumanEval vs MBPP, or benchmarks differing in problem-solving approaches (greedy vs dynamic programming) would better demonstrate discriminative power. 2. Estimator Stabili
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Materials Science · Machine Learning and Data Classification
