Estimating problem difficulty without ground truth using Large Language Model comparisons

Marthe Ballon; Andres Algaba; Brecht Verbeken; Vincent Ginis

arXiv:2512.14220·cs.LG·December 17, 2025

Estimating problem difficulty without ground truth using Large Language Model comparisons

Marthe Ballon, Andres Algaba, Brecht Verbeken, Vincent Ginis

PDF

Open Access

TL;DR

This paper introduces LLM compare, a novel, scalable method for estimating problem difficulty without ground truth, using pairwise comparisons and Bradley-Terry scores, effective even on out-of-distribution problems.

Contribution

The paper presents LLM compare, a ground truth-independent, model-agnostic difficulty estimation method that aligns well with human judgments and is robust to hallucinations.

Findings

01

High correlation with human annotations (Pearson r ≥ 0.80)

02

Robust to hallucinations with less than 6% degradation

03

Addresses out-of-distribution problem difficulty estimation

Abstract

Recent advances in the finetuning of large language models (LLMs) have significantly improved their performance on established benchmarks, emphasizing the need for increasingly difficult, synthetic data. A key step in this data generation pipeline is a method for estimating problem difficulty. Current approaches, such as human calibration or performance-based scoring, fail to generalize to out-of-distribution problems, i.e. problems currently unsolvable by humans and LLMs, because they are not scalable, time-consuming, and ground truth dependent. Therefore, we propose a new method for estimating problem difficulty, LLM compare, that addresses these limitations. An LLM performs pairwise difficulty comparisons, and then Bradley-Terry scores are computed based on the outcomes. To validate our method, we first propose a conceptual framework that positions existing approaches on three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning