Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization
Amir Saeidi, Shivanshu Verma, Aswin RRV, Kashif Rasul, Chitta Baral

TL;DR
This paper introduces Triple Preference Optimization (TPO), a novel one-step preference learning method that enhances reasoning and instruction-following in large language models, outperforming existing methods like DPO with less data.
Contribution
The paper proposes TPO, a new preference optimization technique that overcomes limitations of DPO, improving LLM alignment in reasoning and instruction-following tasks with a single-step approach.
Findings
TPO outperforms DPO and SimPO on multiple benchmarks.
TPO achieves up to 19.2% improvement on GSM8K.
TPO requires less data than DPO for comparable performance.
Abstract
Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of Large Language Models (LLMs). However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome these shortcomings. While studies have shown that DPO improves instruction-following capabilities, it negatively impacts the reasoning ability of LLMs. Additionally, DPO is highly sensitive to judgment noise in preference datasets and the size of the training set. Although several modifications to DPO have been proposed, they still fail to fully resolve these issues. To address these limitations, we propose Triple Preference Optimization (TPO), a new preference learning method designed to enhance both reasoning and instruction-following abilities through one-step optimization. We compare TPO against DPO and its recent variants using state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗tpo-alignment/Llama-3-8B-TPO-40kmodel
- 🤗tpo-alignment/Llama-3-8B-TPO-L-40kmodel· 2 dl2 dl
- 🤗tpo-alignment/Instruct-Llama-3-8B-TPO-y2model· 2 dl2 dl
- 🤗tpo-alignment/Instruct-Llama-3-8B-TPO-y3model· 2 dl2 dl
- 🤗tpo-alignment/Instruct-Llama-3-8B-TPO-y4model· 3 dl3 dl
- 🤗tpo-alignment/Instruct-Llama-3-8B-TPO-L-y2model· 3 dl3 dl
- 🤗tpo-alignment/Mistral-7B-TPO-40kmodel· 4 dl4 dl
- 🤗tpo-alignment/Mistral-Instruct-7B-TPO-y2-v0.1model· 3 dl3 dl
- 🤗tpo-alignment/Mistral-Instruct-7B-TPO-y2-v0.2model· 1 dl1 dl
- 🤗tpo-alignment/Mistral-Instruct-7B-TPO-y3model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Multi-Objective Optimization Algorithms
MethodsDirect Preference Optimization · ALIGN · Shrink and Fine-Tune
