Advancing LLM Reasoning Generalists with Preference Trees
Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng,, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou,, Hao Peng, Zhiyuan Liu, Maosong Sun

TL;DR
This paper introduces Eurus, a new suite of open-source LLMs optimized for reasoning, achieving state-of-the-art results through specialized fine-tuning and a novel high-quality alignment dataset called UltraInteract.
Contribution
The paper presents Eurus models, UltraInteract dataset, and a new reward modeling approach, advancing reasoning capabilities of open-source LLMs beyond existing methods.
Findings
Eurus-70B surpasses GPT-3.5 Turbo on reasoning benchmarks.
UltraInteract improves reasoning performance through preference learning.
Eurus models achieve high accuracy on LeetCode and TheoremQA.
Abstract
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests covering five tasks, and achieves a 33.3% pass@1 accuracy on LeetCode and 32.6% on TheoremQA, two challenging benchmarks, substantially outperforming existing open-source models by margins more than 13.3%. The strong performance of Eurus can be primarily attributed to UltraInteract, our newly-curated large-scale, high-quality alignment dataset specifically designed for complex reasoning tasks. UltraInteract can be used in both supervised fine-tuning and preference learning. For each…
Peer Reviews
Decision·ICLR 2025 Poster
1. With regards to soundness, I feel that the necessary experiments have been run to validate the majority of claims, especially where those claims are with regards to methodological contributions. The authors have also taken pains to remove contaminated data from their work in order to make comparisons fair and meaningful, including when reporting others' work. 2. The presented language models have strong performance, and the data and reward models are in and of themselves useful contributions
1. As a minor point the spelling and grammar could be improved; for instance "Is proprietary models" (line 470) should be "Are proprietary models", and more generally things like "Perference Learning" (line 247). More substantially some of the references point to the wrong sections (e.g. the reference to section 5 (replaced with 6) (line 255) -- in this case harming readability (hence my review of the presentation...) 2. I feel that the modification to the reward model could be better motivated
Authors use a new method to synthesize a dataset for SFT and preference learning, which could potentially enhance model's reasoning abilities. The intuition behind the synthesis method is straightforward and easy to be understood. I think the dataset is cool and it could be a potential approach for model to learn how to improve the response. Plus, the insights on preference learning algorithm is interesting.
1). I agree that providing trajectories to guide model improvements is a potential approach. However, during the training process, I believe that the vertical improvement information, sequential refinement across turns, may not be effectively learned. This is because current preference algorithms primarily focus on horizontal comparisons, assessing responses within the same turn. 2). The reasons behind the better performance of EURES are hard to track and some studies will be necessary if auth
- The paper is advancing open science by making the training data and model checkpoints public. Given the significant improvements in reasoning tasks, it is likely that these assets will be helpful to other researchers. - The paper also proposes a new way of training reward models that is better suited to reasoning tasks. In addition, the training datasets have multi-step attempts that contain mistakes and tool usage, which is unlike other preference datasets. - The experimental section is detai
- The heavy reliance on GPT responses makes me feel like this is more of distilling GPT. Also, it is not clear what are the usage limitations that will arise from using a proprietary model like GPT4. As shown in tab7, this was crucial for obtaining good performance. - The problem of the likelihood of chosen responses going down in reasoning is a known issue and studied prior work [1], which is not cited in the paper (the related work is quite short) - The term “multi-turn action” was confusing.
Code & Models
- 🤗openbmb/Eurus-7b-sftmodel· 1.4k dl· ♡ 181.4k dl♡ 18
- 🤗openbmb/Eurus-7b-ktomodel· 39 dl· ♡ 1339 dl♡ 13
- 🤗openbmb/Eurus-70b-sftmodel· 86 dl· ♡ 586 dl♡ 5
- 🤗openbmb/Eurus-70b-ncamodel· 94 dl· ♡ 1294 dl♡ 12
- 🤗openbmb/Eurus-RM-7bmodel· 270 dl· ♡ 28270 dl♡ 28
- 🤗openbmb/Eurux-8x22b-ncamodel· 28 dl· ♡ 2828 dl♡ 28
- 🤗openbmb/Eurux-8x22b-ktomodel· 23 dl· ♡ 823 dl♡ 8
- 🤗RichardErkhov/openbmb_-_Eurus-7b-sft-ggufmodel· 52 dl52 dl
- 🤗RichardErkhov/openbmb_-_Eurus-7b-kto-ggufmodel· 36 dl36 dl
- 🤗RichardErkhov/openbmb_-_Eurus-70b-sft-ggufmodel· 9 dl9 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Rough Sets and Fuzzy Logic
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Sparse Evolutionary Training · Linear Layer · Dropout · Layer Normalization · Multi-Head Attention · Weight Decay
