Thinking Preference Optimization
Wang Yang, Hongye Jin, Jingfeng Yang, Vipin Chaudhary, Xiaotian Han

TL;DR
Thinking Preference Optimization (ThinkPO) is a post-training method that improves long chain-of-thought reasoning in language models by encouraging longer reasoning outputs through preference optimization, without needing new data.
Contribution
The paper introduces ThinkPO, a simple method that enhances reasoning performance of SFT models by leveraging existing short and long CoT responses for preference optimization.
Findings
Increases math reasoning accuracy by 8.6%.
Boosts output length by 25.9%.
Improves the performance of distilled SFT models on benchmarks.
Abstract
Supervised Fine-Tuning (SFT) has been a go-to and effective method for enhancing long chain-of-thought (CoT) reasoning in relatively small LLMs by fine-tuning them with long CoT responses from larger LLMs. To continually improve reasoning abilities, we can either collect new high-quality long CoT reasoning SFT data or repeatedly train on existing SFT datasets. However, acquiring new long CoT SFT data is costly and limited, while repeated training often results in a performance plateau or decline. To further boost the performance with the SFT data, we propose Thinking Preference Optimization (ThinkPO), a simple yet effective post-SFT method that enhances long CoT reasoning without requiring new long CoT responses. Instead, ThinkPO utilizes readily available or easily obtainable short CoT reasoning responses as rejected answers and long CoT responses as chosen answers for the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Modeling and Causal Inference · Topic Modeling · Advanced Graph Neural Networks
MethodsShrink and Fine-Tune
