Learning to Reason via Self-Iterative Process Feedback for Small Language Models
Kaiyuan Chen, Jin Wang, Xuejie Zhang

TL;DR
This paper introduces a self-iterative feedback method for small language models, enabling them to improve reasoning without external supervision, leading to significant performance gains and better generalization.
Contribution
It presents a novel self-feedback training approach combining ORPO and process supervision, enhancing reasoning abilities of small language models without costly external signals.
Findings
Improves Gemma-2B accuracy by 12.43 on GSM8K
Enhances Pass@1 by 3.95 on MBPP
Shows better out-of-domain generalization on MMLU_Math and HumanEval
Abstract
Small language models (SLMs) are more efficient, cost-effective, and customizable than large language models (LLMs), though they often underperform in specific areas like reasoning. Past methods for enhancing SLMs' reasoning, such as supervised fine-tuning and distillation, often depend on costly external signals, resulting in SLMs being overly confident with limited supervision signals, thus limiting their abilities. Therefore, this study enables SLMs to learn to reason from self-iterative feedback. By combining odds ratio preference optimization (ORPO), we fine-tune and align SLMs using positive and negative signals generated by themselves. Additionally, we introduce process supervision for rewards in preference alignment by sampling-based inference simulation and process reward models. Compared to Supervised Fine-Tuning (SFT), our method improves the performance of Gemma-2B by 12.43…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Topic Modeling
MethodsALIGN
