Learning to Rank Chain-of-Thought: Using a Small Model
Eric Hanchen Jiang, Haozheng Luo, Shengyuan Pang, Xiaomin Li, Zhenting Qi, Hengli Li, Cheng-Fu Yang, Zongyu Lin, Xinfeng Li, Hao Xu, Kai-Wei Chang, Ying Nian Wu

TL;DR
This paper presents EORM, a lightweight energy-based verifier for Chain-of-Thought reasoning in LLMs, achieving high accuracy with minimal parameters and outperforming expensive verification methods.
Contribution
Introduces EORM, a small, efficient, post-hoc verifier that effectively ranks reasoning solutions using simple outcome labels, reducing computational costs.
Findings
EORM achieves 90.7% accuracy on GSM8k with only 55M parameters.
EORM outperforms traditional reward models and matches or exceeds more resource-intensive methods.
EORM generalizes well to out-of-distribution problems and unseen models.
Abstract
Large Language Models (LLMs) struggle with reliable mathematical reasoning, and current verification methods are often computationally expensive. This paper introduces the Energy Outcome Reward Model (EORM), a highly efficient, lightweight post-hoc verifier designed to address this challenge. EORM uses an energy-based framework to rank Chain-of-Thought (CoT) solutions, learning to distinguish correct from incorrect reasoning using only simple outcome labels, thus eliminating the need for expensive annotations. With only 55M parameters, over 127 times smaller than typical reward models, EORM boosts the accuracy of Llama 3 8B to 90.7\% on GSM8k and 63.7\% on MATH. This performance is achieved by efficiently selecting the optimal reasoning path from a pool of candidates, allowing it to match or exceed the accuracy of far more resource-intensive Best-of-N sampling techniques. Crucially, our…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
* **Exceptional Efficiency:** The most significant strength is the model's size. At only **55M parameters**, EORM is over 127 times smaller than standard 7B-8B parameter reward models. This makes it an incredibly practical and lightweight tool that can be cheaply deployed for inference alongside a generator LLM. * **Low-Cost Supervision:** The model is trained *only* on binary outcome labels (correct/incorrect). This is a massive practical advantage over Process Reward Models (PRMs), which requ
* **Confusing "ORM" Baseline:** In Figure 3, the paper compares EORM against "ORM". This is confusing because the paper's own model is an "Energy **Outcome Reward Model**" (EORM). The paper also mentions "traditional Outcome Reward Models", but the specific architecture and loss function (e.g., standard classification cross-entropy?) of this "ORM" baseline are not defined. This makes the comparison in Figure 3 difficult to interpret. * **Massive Training Data Requirement:** The model's success
[S1] Methodological Modification. The paper provides a clear and focused description of its methodological modification of the energy-based modeling framework for ranking Chain-of-Thought outputs. The adaptation of EBM to reasoning verification is reasonable and technically sound. [S2] Efficiency and Scalability. The proposed model is lightweight, using only 55M parameters compared to multi-billion-parameter reward models, which demonstrates strong potential for efficient and scalable deploymen
[W1] Unclear and Potentially Unfair Comparison Setup (Table 2). The main quantitative comparison in Table 2 does not clearly specify how the baselines were selected or evaluated. The listed models, including WizardMath, DART-Math, and MetaMath, incorporate different forms of instruction tuning or reinforcement learning (for example, RLHF, RLEIF, or preference optimization), but the paper does not clarify whether they were re-evaluated under the same experimental setup, dataset splits, or samplin
* The paper tackles an interesting and practically important problem: how to approximately verify the CoT responses from LLMs. * The EORM model trained by the authors is indeed lightweight compared to typical ORMs, which has practical implications. * EORM is trained using an original approach with a special pairwise loss. * The authors show that EORM generalizes (to a degree) between different base LLMs as well as between datasets.
In general I like the idea of training a lightweight ORM, and using an special loss for that. However, a significant weakness of the paper is its poor experimental setup which makes assessing the performance of EORM difficult. 1. In Table 2, you present a large set of results comparing EORM with other methods / base models. However, these results are not aligned on generation budgets, so comparing them is not meaningful. For instance, it may turn out that if some of the considered base models w
- **Strong empirical results:** State-of-the-art accuracy on GSM8k and MATH, with solid OOD generalization. - **Efficiency:** 55M-parameter verifier outperforms 7B reward models, highlighting excellent cost–performance tradeoffs. - **Clarity:** Methodology, loss function, and data preparation are explicitly described, enabling replication. - **Robust ablations:** Demonstrate architecture sensitivity and tokenizer invariance, strengthening the argument for generalizability. - **Practical
- **Limited theoretical depth:** The approach is primarily empirical; the energy formulation does not introduce new learning theory. - **Incremental conceptual novelty:** Extends known EBM and reward-modeling ideas rather than introducing new reasoning paradigms. - **Evaluation scope:** Focused on math reasoning; broader reasoning domains (commonsense, logical entailment) are not tested. - **No human interpretability study:** While effective, it is unclear what the model learns as a “signa
+ The proposed method is evaluated against a large collection of fine-tuned models (Mistral, Llama2, DeepSeekMath, Llama 3, and Qwen 2.5) with parameters ranging around 7B. + The work shows how to create a relatively small and powerful reward model that beats majority voting and fine-tuned models
Presentation: - Figure 1: Using an incorrect solution (70) to the "daps baps" problem as an example of something that an EORM thinks is correct and marking it in green can be misleading. Note that the correct answer is 40, as 4 * 10 daps = 7*10 yaps and 5*14 yaps = 70 yaps = 3*14 baps = 42 baps - line 188-189 "where Zθ is the partition function, a normalization constant that ensures pθ (y) sums to unity:" -> where Zθ is a normalization constant that ensures pθ (y) sums to unity: - 264-265 "For
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Decision Making
MethodsLLaMA · High-Order Consensuses
