Thinking Preference Optimization

Wang Yang; Hongye Jin; Jingfeng Yang; Vipin Chaudhary; Xiaotian Han

arXiv:2502.13173·cs.LG·February 20, 2025

Thinking Preference Optimization

Wang Yang, Hongye Jin, Jingfeng Yang, Vipin Chaudhary, Xiaotian Han

PDF

Open Access 1 Repo

TL;DR

Thinking Preference Optimization (ThinkPO) is a post-training method that improves long chain-of-thought reasoning in language models by encouraging longer reasoning outputs through preference optimization, without needing new data.

Contribution

The paper introduces ThinkPO, a simple method that enhances reasoning performance of SFT models by leveraging existing short and long CoT responses for preference optimization.

Findings

01

Increases math reasoning accuracy by 8.6%.

02

Boosts output length by 25.9%.

03

Improves the performance of distilled SFT models on benchmarks.

Abstract

Supervised Fine-Tuning (SFT) has been a go-to and effective method for enhancing long chain-of-thought (CoT) reasoning in relatively small LLMs by fine-tuning them with long CoT responses from larger LLMs. To continually improve reasoning abilities, we can either collect new high-quality long CoT reasoning SFT data or repeatedly train on existing SFT datasets. However, acquiring new long CoT SFT data is costly and limited, while repeated training often results in a performance plateau or decline. To further boost the performance with the SFT data, we propose Thinking Preference Optimization (ThinkPO), a simple yet effective post-SFT method that enhances long CoT reasoning without requiring new long CoT responses. Instead, ThinkPO utilizes readily available or easily obtainable short CoT reasoning responses as rejected answers and long CoT responses as chosen answers for the same…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

uservan/ThinkPO
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBayesian Modeling and Causal Inference · Topic Modeling · Advanced Graph Neural Networks

MethodsShrink and Fine-Tune