DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller Language Models
Chengcheng Han, Xiaowei Du, Che Zhang, Yixin Lian, Xiang Li, Ming Gao,, Baoyuan Wang

TL;DR
This paper introduces DialCoT, a dialogue-based reasoning method combined with PPO optimization, to improve the reasoning abilities of Smaller Language Models on complex arithmetic tasks, outperforming previous approaches.
Contribution
We propose DialCoT with dialogue-guided reasoning and PPO-based path optimization, enabling smaller models to effectively perform complex reasoning tasks.
Findings
Significant performance improvements on four arithmetic reasoning datasets.
DialCoT reduces task difficulty by breaking down questions into sub-questions.
PPO optimization enhances reasoning path selection and accuracy.
Abstract
Chain-of-Thought (CoT) prompting has proven to be effective in enhancing the reasoning capabilities of Large Language Models (LLMs) with at least 100 billion parameters. However, it is ineffective or even detrimental when applied to reasoning tasks in Smaller Language Models (SLMs) with less than 10 billion parameters. To address this limitation, we introduce Dialogue-guided Chain-of-Thought (DialCoT) which employs a dialogue format to generate intermediate reasoning steps, guiding the model toward the final answer. Additionally, we optimize the model's reasoning path selection using the Proximal Policy Optimization (PPO) algorithm, further enhancing its reasoning capabilities. Our method offers several advantages compared to previous approaches. Firstly, we transform the process of solving complex reasoning questions by breaking them down into a series of simpler sub-questions,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks
MethodsEntropy Regularization · Proximal Policy Optimization
