Automatic Curriculum Expert Iteration for Reliable LLM Reasoning
Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, Doyen Sahoo

TL;DR
This paper introduces Auto-CEI, a method that improves large language model reasoning by automatically balancing assertiveness and conservativeness, reducing hallucinations and laziness through curriculum-based reward adjustments and expert iteration.
Contribution
Auto-CEI is a novel approach that combines expert iteration with automatic curriculum adjustment to enhance LLM reasoning and alignment, addressing hallucinations and over-cautious responses.
Findings
Auto-CEI outperforms SOTA baselines on logical reasoning, mathematics, and planning tasks.
It effectively balances assertiveness and conservativeness in LLM responses.
Auto-CEI reduces hallucinations and laziness, improving robustness and reliability.
Abstract
Hallucinations (i.e., generating plausible but inaccurate content) and laziness (i.e. excessive refusals or defaulting to "I don't know") persist as major challenges in LLM reasoning. Current efforts to reduce hallucinations primarily focus on factual errors in knowledge-grounded tasks, often neglecting hallucinations related to faulty reasoning. Meanwhile, some approaches render LLMs overly conservative, limiting their problem-solving capabilities. To mitigate hallucination and laziness in reasoning tasks, we propose Automatic Curriculum Expert Iteration (Auto-CEI) to enhance LLM reasoning and align responses to the model's capabilities--assertively answering within its limits and declining when tasks exceed them. In our method, Expert Iteration explores the reasoning trajectories near the LLM policy, guiding incorrect paths back on track to reduce compounding errors and improve…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper is clearly motivated, the problem of hallucinations in LLM reasoning has not been explored as much. - The paper presents the problem, the methods and the baselines clearly. - The results on the three datasets, MATH, blocksworld and boardgameQA are comprehensive and provide sufficient evidence of the utility of the method. - The authors also present sufficient ablations of their method, with varying versions of the curriculum used.
- Missing citations for central EI works: STaR (https://arxiv.org/abs/2203.14465), Training Chain-of-Thought via Latent-Variable Inference (https://arxiv.org/abs/2403.04642) - What other measures of difficulty are possible (very recent paper that the authors can choose to ignore since it came out after the paper deadline: https://arxiv.org/abs/2410.04707)? Can a linear probe decode difficulty? A discussion is needed. - Similarly, are there other mechanisms of knowing when to say I dont know poss
1. Using expert iteration is a good idea. 2. The authors consider the trade off between avoiding hallucinations and avoiding laziness, which is important. 3. This paper aims to avoiding reasoning hallucinations and teaching the model to say IDK to reasoning tasks which beyond its ability.
1. Although considering the trade off between helpfulness and laziness, the paper control this trade off by hyper-parameters, instead of proposing some principles or methods to chose the optimal hyper-parameters. 2. The evaluation metrics in the paper are incomplete; for example, IDK only measures the proportion of times the model outputs "IDK" without assessing the accuracy of those IDK responses. 3. The experiments in the paper may involve unfair comparisons, as AUTO-CEI conducts multiple sear
1. The paper presents a reasonable motivation that previous approaches to addressing hallucination problems often suffer from overcorrection, leading to overly conservative responses from LLMs on many questions. 2. The paper introduces an innovative reward mechanism that dynamically balances the choice between responding or refusing to answer based on question difficulty. Subsequent experiments demonstrate that this method effectively handles the trade-off between answering and refusing. Figure
1. The paper’s reward uses the length of the generated text to measure the level of the exploration. This limits the method’s generalizability and scalability, as it’s challenging to find a universal text processing metric for different types of text (e.g., code, tables). Furthermore, reward parameters such as c1 and c2 also vary across datasets, which makes me concerned about the method’s effectiveness in situations where training and testing distributions are not similar. 2. The experimenta
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducational Technology and Assessment · Software Engineering Techniques and Practices
MethodsALIGN · Focus
