Uncovering the Impact of Chain-of-Thought Reasoning for Direct Preference Optimization: Lessons from Text-to-SQL
Hanbing Liu, Haoyang Li, Xiaokang Zhang, Ruotong Chen, Haiyong Xu,, Tian Tian, Qi Qi, Jing Zhang

TL;DR
This paper demonstrates that augmenting Text-to-SQL datasets with synthetic Chain-of-Thought reasoning significantly enhances the effectiveness of Direct Preference Optimization, revealing the importance of reasoning explanations for complex NLP tasks.
Contribution
The study shows that adding synthetic CoT solutions to Text-to-SQL datasets enables DPO to improve performance, highlighting the critical role of reasoning in preference-based training.
Findings
CoT augmentation leads to consistent performance gains
CoT reasoning mitigates reward hacking in DPO
Enhances model scalability and discriminative ability
Abstract
Direct Preference Optimization (DPO) has proven effective in complex reasoning tasks like math word problems and code generation. However, when applied to Text-to-SQL datasets, it often fails to improve performance and can even degrade it. Our investigation reveals the root cause: unlike math and code tasks, which naturally integrate Chain-of-Thought (CoT) reasoning with DPO, Text-to-SQL datasets typically include only final answers (gold SQL queries) without detailed CoT solutions. By augmenting Text-to-SQL datasets with synthetic CoT solutions, we achieve, for the first time, consistent and significant performance improvements using DPO. Our analysis shows that CoT reasoning is crucial for unlocking DPO's potential, as it mitigates reward hacking, strengthens discriminative capabilities, and improves scalability. These findings offer valuable insights for building more robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies
MethodsDirect Preference Optimization
