QueST: Incentivizing LLMs to Generate Difficult Problems
Hanxu Hu, Xingxing Zhang, Jannis Vamvas, Rico Sennrich, Furu Wei

TL;DR
QueST introduces a novel framework for generating challenging coding problems to improve large language models' reasoning and coding performance, surpassing existing datasets and models through synthetic data creation and fine-tuning.
Contribution
The paper presents QueST, a new difficulty-aware problem generation method that enhances LLM training with synthetic challenging problems, improving downstream performance.
Findings
Generated problems outperform GPT-4o in difficulty.
Fine-tuning on synthetic data improves model performance.
Synthetic problems enable smaller models to match larger ones.
Abstract
Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems. However, their scalability is limited by human-labeled datasets and the lack of large-scale, challenging coding problem training data. Existing competitive coding datasets contain only thousands to tens of thousands of problems. Previous synthetic data generation methods rely on either augmenting existing instruction datasets or selecting challenging problems from human-labeled data. In this paper, we propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning that directly optimizes specialized generators to create challenging coding problems. Our trained generators demonstrate superior capability compared to even GPT-4o at creating challenging problems that benefit downstream performance. We…
Peer Reviews
Decision·Submitted to ICLR 2026
1) Clear, modular pipeline with explicit math and algorithms. The $\delta$ metric and RFT selection are precisely defined (Eqs. 5–9), with a practical filtering step for invalid executions 2) The edge-weighting scheme combining co-occurrence and average difficulty is easy to implement and justified by seed annotations in TACO 3) Training on QueST-generated data with Qwen3-8B as teacher matches or exceeds prior SFT datasets that used larger/stronger teachers, especially on harder USACO levels (
1) Difficulty proxy $\delta$ may conflate “hardness” with generator/judge idiosyncrasies; causal link not isolated. 2) The paper does not evaluate $\delta$ stability across different judge models. 3) The paper itself cites difficulty-aware/rejection methods and concept-graph generation (e.g., MathScale; DART-math; "weakness-driven" synthesis) in related work; QueST’s novelty is the combination plus code-specific engineering, not the first instance of difficulty-aware synthetic problem generati
1) The paper proposes training a specialized teacher / problem generator model, rather than prompting a fixed model, to create synthetic code data. 2) The paper includes interesting ablations that anticipate and answer likely reader questions (e.g., Table 3).
1) The primary novelty of the framework arises from the fine-tuned teacher. Yet, Table 5 shows that the trained teacher performs roughly the same as a fixed teacher (gpt-4o), so the added value (and claimed flexibility) is unclear. Without gains from training a teacher model, the rest of the pipeline/framework (concept extraction, synthetic data generation, and filtering), largely mirrors prior works. 2) The magnitude of improvement seems small / possibly noisy. In Table 4, RL improves average
1. Difficulty Estimation via Self-Consistency: The authors estimate problem difficulty using self-consistency across multiple model outputs. 2. Difficulty-Guided Sampling: For each prompt, multiple candidate problems are generated, and only the most difficult one (based on the proposed difficulty metric) is retained for training. 3. Instead of letting the model generate simple problems repeatedly, the method continuously selects and trains on the most challenging problems, thus enhancing the gen
1. The paper does not introduce a fundamentally new data synthesis approach but rather extends MathScale with heuristic sampling improvements based on self-consistency. Such heuristics are intuitive but may not lead to substantial long-term impact. 2. The proposed method for measuring problem difficulty mainly relies on self-consistency within rollouts, which appears heuristic and lacks deeper theoretical justification to confirm its validity. 3.The synthetic data do not show clear superiority
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks
