TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees

Weibin Liao; Xu Chu; Yasha Wang

arXiv:2410.12854·cs.CL·October 27, 2025·2 cites

TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees

Weibin Liao, Xu Chu, Yasha Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Tree Preference Optimization (TPO), a novel method for aligning large language models with preference trees that improves learning efficiency and reasoning performance over existing binary preference methods like DPO.

Contribution

TPO directly learns from entire preference trees during fine-tuning, formulating alignment as a preference list ranking problem and incorporating adaptive step rewards for better long-chain reasoning.

Findings

01

TPO outperforms DPO on multiple LLMs and datasets.

02

TPO enhances long-chain reasoning capabilities.

03

TPO achieves consistent improvements in mathematical reasoning tasks.

Abstract

In the domain of complex reasoning tasks, such as mathematical reasoning, recent advancements have proposed the use of Direct Preference Optimization (DPO) to suppress output of dispreferred responses, thereby enhancing the long-chain reasoning capabilities of large language models (LLMs). To this end, these studies employed LLMs to generate preference trees via Tree-of-thoughts (ToT) and sample the paired preference responses required by the DPO algorithm. However, the DPO algorithm based on binary preference optimization is unable to learn multiple responses with varying degrees of preference/dispreference that provided by the preference trees, resulting in incomplete preference learning. In this work, we introduce Tree Preference Optimization (TPO), that does not sample paired preference responses from the preference tree; instead, it directly learns from the entire preference tree…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. This paper is well motivated and well presented. 2. The connection between LLM alignment and learn-to-rank problems is well explained. 3. The proposed implicit reward model formulation in Eq(10) is novel.

Weaknesses

1. Regarding the preference list ranking, there might have been some studies on its direct optimization approaches [1-3]. The authors could consider discussing their unique challenges or difference to those works. Ideally, the authors might consider adapting their ideas as baselines for a more comprehensive comparison. 2. The backbone general LLMs are restricted to only one type, Qwen. The authors might considering experimenting on more types of LLMs, including Llama-3, Phi-2, Mistral, etc. 3.

Reviewer 02Rating 5Confidence 5

Strengths

1. The paper is well-written and easy to follow. 2. The paper proposes using the classical Lambda weight from Learning-to-Rank (LTR) to construct the TPO loss, which enhances the accuracy of list ranking. This is an interesting design. 3. The design of a loss function for multi-branch and multi-step scenarios is a highly significant research direction.The paper makes several attempts in this direction.

Weaknesses

1. The most critical issue is that I believe obtaining noise-free, list-wise preference data is very costly. A more essential challenge is how to reliably construct such data. The construction method in the paper likely introduces substantial noise, and I am skeptical about the positive impact of aligning the llm to a noisy list ranking. 2. The Adaptive Step Reward mechanism proposed is overly intuitive, lacking more rigorous experimental analysis and theoretical support. 3. The experimental des

Reviewer 03Rating 6Confidence 4

Strengths

- The paper is well-written and presents comprehensive empirical results with publicly available implementation code, demonstrating the effectiveness of the proposed approach. - This paper presents a novel approach by decomposing Tree of Thoughts into a hierarchical structure of multi-branch, multi-step responses and leveraging Learning-to-Rank algorithms for preference optimization.

Weaknesses

- The rationale for selecting LambdaLoss over other LTR loss functions in Section 3.1 requires further clarification. Maybe you can justify this choice by highlighting the specific advantages of LambdaLoss in your application. - In Equation 10, I notice that RM=0 in Line 298, while my understanding suggests that cosine similarity should be 1 when the content is shared. Could you please clarify this point? Maybe you can provide more explanation of your *Adaptive Step Reward*. - The ablation s

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies

MethodsDirect Preference Optimization