Uncovering the Impact of Chain-of-Thought Reasoning for Direct   Preference Optimization: Lessons from Text-to-SQL

Hanbing Liu; Haoyang Li; Xiaokang Zhang; Ruotong Chen; Haiyong Xu,; Tian Tian; Qi Qi; Jing Zhang

arXiv:2502.11656·cs.CL·February 18, 2025

Uncovering the Impact of Chain-of-Thought Reasoning for Direct Preference Optimization: Lessons from Text-to-SQL

Hanbing Liu, Haoyang Li, Xiaokang Zhang, Ruotong Chen, Haiyong Xu,, Tian Tian, Qi Qi, Jing Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper demonstrates that augmenting Text-to-SQL datasets with synthetic Chain-of-Thought reasoning significantly enhances the effectiveness of Direct Preference Optimization, revealing the importance of reasoning explanations for complex NLP tasks.

Contribution

The study shows that adding synthetic CoT solutions to Text-to-SQL datasets enables DPO to improve performance, highlighting the critical role of reasoning in preference-based training.

Findings

01

CoT augmentation leads to consistent performance gains

02

CoT reasoning mitigates reward hacking in DPO

03

Enhances model scalability and discriminative ability

Abstract

Direct Preference Optimization (DPO) has proven effective in complex reasoning tasks like math word problems and code generation. However, when applied to Text-to-SQL datasets, it often fails to improve performance and can even degrade it. Our investigation reveals the root cause: unlike math and code tasks, which naturally integrate Chain-of-Thought (CoT) reasoning with DPO, Text-to-SQL datasets typically include only final answers (gold SQL queries) without detailed CoT solutions. By augmenting Text-to-SQL datasets with synthetic CoT solutions, we achieve, for the first time, consistent and significant performance improvements using DPO. Our analysis shows that CoT reasoning is crucial for unlocking DPO's potential, as it mitigates reward hacking, strengthens discriminative capabilities, and improves scalability. These findings offer valuable insights for building more robust…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RUCKBReasoning/DPO_Text2SQL
noneOfficial

Videos

Uncovering the Impact of Chain-of-Thought Reasoning for Direct Preference Optimization: Lessons from Text-to-SQL· underline

Taxonomy

TopicsSemantic Web and Ontologies

MethodsDirect Preference Optimization