Teaching LLMs to Refine with Tools
Dian Yu, Yuheng Zhang, Jiahao Xu, Tian Liang, Linfeng Song, Zhaopeng, Tu, Haitao Mi, Dong Yu

TL;DR
This paper introduces CaP, a novel method that uses external tools and preference optimization to improve the reasoning capabilities of large language models through iterative refinement, surpassing previous methods limited to within-format improvements.
Contribution
CaP is the first approach to combine external tool use with preference optimization for cross-reasoning refinement in LLMs, enhancing their self-improvement capabilities.
Findings
CaP effectively improves cross-reasoning refinement in LLMs.
Preference optimization is crucial for successful refinement.
Sampling strategies influence inference efficiency and quality.
Abstract
Large language models (LLMs) can refine their responses based on feedback, enabling self-improvement through iterative training or test-time refinement. However, existing methods predominantly focus on refinement within the same reasoning format, which may lead to non-correcting behaviors. We propose CaP, a novel approach that uses external tools to refine chain-of-thought (CoT) responses generated by the same or other LLMs. CaP employs a two-stage training process: supervised fine-tuning followed by preference optimization with DPO variants. Our observations highlight the critical role of preference optimization in enabling effective refinement. Additionally, we compare several sampling strategies to leverage CoT and tools at inference time. Experimental results demonstrate CaP's potential for effective cross-reasoning refinement and efficient inference.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Natural Language Processing Techniques · Library Science and Information Systems
MethodsDirect Preference Optimization · Focus
