TODO: Enhancing LLM Alignment with Ternary Preferences

Yuxiang Guo; Lu Yin; Bo Jiang; Jiaqi Zhang

arXiv:2411.02442·cs.CL·April 1, 2025

TODO: Enhancing LLM Alignment with Ternary Preferences

Yuxiang Guo, Lu Yin, Bo Jiang, Jiaqi Zhang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces TOBT and TODO, a novel approach for aligning large language models with human preferences by explicitly modeling ties, resulting in improved preference modeling and alignment performance over existing methods.

Contribution

The paper presents TOBT and TODO, extending preference models to include ties, which enhances LLM alignment accuracy and robustness compared to traditional binary preference methods.

Findings

01

TODO outperforms DPO on preference modeling tasks

02

Improved alignment on multiple benchmarks including MT Bench, Piqa, ARC-c, MMLU

03

Effective in both binary and ternary preference settings

Abstract

Aligning large language models (LLMs) with human intent is critical for enhancing their performance across a variety of tasks. Standard alignment techniques, such as Direct Preference Optimization (DPO), often rely on the binary Bradley-Terry (BT) model, which can struggle to capture the complexities of human preferences -- particularly in the presence of noisy or inconsistent labels and frequent ties. To address these limitations, we introduce the Tie-rank Oriented Bradley-Terry model (TOBT), an extension of the BT model that explicitly incorporates ties, enabling more nuanced preference representation. Building on this, we propose Tie-rank Oriented Direct Preference Optimization (TODO), a novel alignment algorithm that leverages TOBT's ternary ranking system to improve preference alignment. In evaluations on Mistral-7B and Llama 3-8B models, TODO consistently outperforms DPO in…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 5

Strengths

1. The paper introduces a novel extension to the Bradley-Terry model by incorporating ties through the Tie-rank Oriented Bradley-Terry (TOBT) model. This allows for more nuanced preference modeling in LLM alignment. 2. The theoretical foundations of the TOBT model and the TODO algorithm are well-developed. The authors provide detailed derivations and clarify how the tie parameter α is integrated into the model. 3. The paper is generally well-written and structured. The methodology, includi

Weaknesses

1. The extension from the traditional Bradley-Terry (BT) model to the Tie-rank Oriented BT (TOBT) model is central to the proposed method. However, the paper provides limited theoretical justification for this extension. The choice of the parameter α in the TOBT model appears somewhat arbitrary, and while Appendix A.3 attempts to justify its value based on balancing loss values, there is no rigorous analysis or sensitivity study demonstrating how different values of α affect the model's performa

Reviewer 02Rating 5Confidence 3

Strengths

- The paper is well written, and it is very easy to understand its core motivations and follow how the entire derivation of the tie-aware BT model and the corresponding TODO model unfolds from the main design assumptions. - The provided results offer reasonable evidence on the benefits of TODO over DPO. The chosen evaluation data and models are adequate. - The paper raises awareness on modeling ties in preference optimizations, where such valuable data often gets discarded under inadequate model

Weaknesses

- One of the core issues with the work lies in the fact that other researchers in 'less recent times' have already attempted and derived generalised versions of the BT model, which better align with multi-class (i.e., non-binary) problems. Even a cursory search online reveals some very relevant literature which is not cited nor discussed in this work. I would require the authors to provide a thorough overview of that relevant work and discuss why a new (tie-aware) derivation might be needed here

Reviewer 03Rating 8Confidence 3

Strengths

1. The idea is well motivated, as ties are relevant especially in human preference elicitation and can be expected to become even more prevalent the stronger generative models get. 2. The experiments are strongly supporting the new objective, the results are convincing. 3. The proposed solution is based on a solid intuition and supported by detailed derivation.

Weaknesses

The only claim that is empirically not supported is that TODO “exhibits better robustness against potential noise in binary preference data.” There is no experiment studying noise in preferences (ties do not need to be caused by noise, they also present adequate equalness). One could easily set up an experiment with artificial noise added to preferences to test this hypothesis.

Code & Models

Repositories

xxares/todo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security · Business Process Modeling and Analysis

MethodsLLaMA · Direct Preference Optimization