QueST: Incentivizing LLMs to Generate Difficult Problems

Hanxu Hu; Xingxing Zhang; Jannis Vamvas; Rico Sennrich; Furu Wei

arXiv:2510.17715·cs.CL·October 21, 2025

QueST: Incentivizing LLMs to Generate Difficult Problems

Hanxu Hu, Xingxing Zhang, Jannis Vamvas, Rico Sennrich, Furu Wei

PDF

Open Access 3 Reviews

TL;DR

QueST introduces a novel framework for generating challenging coding problems to improve large language models' reasoning and coding performance, surpassing existing datasets and models through synthetic data creation and fine-tuning.

Contribution

The paper presents QueST, a new difficulty-aware problem generation method that enhances LLM training with synthetic challenging problems, improving downstream performance.

Findings

01

Generated problems outperform GPT-4o in difficulty.

02

Fine-tuning on synthetic data improves model performance.

03

Synthetic problems enable smaller models to match larger ones.

Abstract

Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1) Clear, modular pipeline with explicit math and algorithms. The $\delta$ metric and RFT selection are precisely defined (Eqs. 5–9), with a practical filtering step for invalid executions 2) The edge-weighting scheme combining co-occurrence and average difficulty is easy to implement and justified by seed annotations in TACO 3) Training on QueST-generated data with Qwen3-8B as teacher matches or exceeds prior SFT datasets that used larger/stronger teachers, especially on harder USACO levels (

Weaknesses

1) Difficulty proxy $\delta$ may conflate “hardness” with generator/judge idiosyncrasies; causal link not isolated. 2) The paper does not evaluate $\delta$ stability across different judge models. 3) The paper itself cites difficulty-aware/rejection methods and concept-graph generation (e.g., MathScale; DART-math; "weakness-driven" synthesis) in related work; QueST’s novelty is the combination plus code-specific engineering, not the first instance of difficulty-aware synthetic problem generati

Reviewer 02Rating 6Confidence 3

Strengths

1) The paper proposes training a specialized teacher / problem generator model, rather than prompting a fixed model, to create synthetic code data. 2) The paper includes interesting ablations that anticipate and answer likely reader questions (e.g., Table 3).

Weaknesses

1) The primary novelty of the framework arises from the fine-tuned teacher. Yet, Table 5 shows that the trained teacher performs roughly the same as a fixed teacher (gpt-4o), so the added value (and claimed flexibility) is unclear. Without gains from training a teacher model, the rest of the pipeline/framework (concept extraction, synthetic data generation, and filtering), largely mirrors prior works. 2) The magnitude of improvement seems small / possibly noisy. In Table 4, RL improves average

Reviewer 03Rating 4Confidence 4

Strengths

1. Difficulty Estimation via Self-Consistency: The authors estimate problem difficulty using self-consistency across multiple model outputs. 2. Difficulty-Guided Sampling: For each prompt, multiple candidate problems are generated, and only the most difficult one (based on the proposed difficulty metric) is retained for training. 3. Instead of letting the model generate simple problems repeatedly, the method continuously selects and trains on the most challenging problems, thus enhancing the gen

Weaknesses

1. The paper does not introduce a fundamentally new data synthesis approach but rather extends MathScale with heuristic sampling improvements based on self-consistency. Such heuristics are intuitive but may not lead to substantial long-term impact. 2. The proposed method for measuring problem difficulty mainly relies on self-consistency within rollouts, which appears heuristic and lacks deeper theoretical justification to confirm its validity. 3.The synthetic data do not show clear superiority

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks