Making Large Language Models Better Reasoners with Alignment
Peiyi Wang, Lei Li, Liang Chen, Feifan Song, Binghuai Lin, and Yunbo Cao, Tianyu Liu, Zhifang Sui

TL;DR
This paper introduces an Alignment Fine-Tuning (AFT) method to improve reasoning in large language models by calibrating their scoring of chain of thought responses, addressing assessment misalignment issues.
Contribution
The paper proposes a novel AFT paradigm that uses constraint alignment loss to enhance reasoning capabilities of LLMs and analyzes the importance of constraints in ranking-based alignment methods.
Findings
AFT significantly improves reasoning performance on four benchmarks.
Constraint alignment loss effectively calibrates LLM scores.
Analysis shows constraints are crucial in ranking-based alignment methods.
Abstract
Reasoning is a cognitive process of using evidence to reach a sound conclusion. The reasoning capability is essential for large language models (LLMs) to serve as the brain of the artificial general intelligence agent. Recent studies reveal that fine-tuning LLMs on data with the chain of thought (COT) reasoning process can significantly enhance their reasoning capabilities. However, we find that the fine-tuned LLMs suffer from an \textit{Assessment Misalignment} problem, i.e., they frequently assign higher scores to subpar COTs, leading to potential limitations in their reasoning abilities. To address this problem, we introduce an \textit{Alignment Fine-Tuning (AFT)} paradigm, which involves three steps: 1) fine-tuning LLMs with COT training data; 2) generating multiple COT responses for each question, and categorizing them into positive and negative ones based on whether they achieve…
Peer Reviews
Decision·Submitted to ICLR 2024
Overall the paper is easy to read and the presentation of the main ideas is clear. The proposed method seems novel and is well-motivated. The empirical results are convincing.
Although the intention is to improve the "reasoning" capability of the model, the additional loss function makes use of the slightly risky assumption that generated outputs with the correct final answer should be assigned higher score than those with the wrong final answer. One could argue that the chain of thoughts itself is perhaps more important than the final answer and some negative examples should still be scored higher than positive examples with "wrong" reasoning steps. Obviously this c
The authors propose a sensible approach to do fine-tuning. The proposed fine-tuning loss including the constraints for negative examples is sufficiently introduced and defined. The method is also easily applicable to other problems, given that negative samples are identified. Also, the authors provide runnable code for the review, backing up the clarity and quality of their work. The evaluation results are promising as well. The approach is mostly better than the chosen baselines, thereby showi
The related work for preference alignment a tad vague: Although it includes the a variety of strongly related and relevant works, the focus of the discussion could/should be more on the diverse strategies of the LLMs tuned for mathematical reasoning tasks. Referenced works could thus be better introduced and compared to based on the respective losses/techniques. This would make clear how innovative/novel the proposed technique is. There is no clear argumentation why other mathematical datasets
[+] The paper identified an important problem that may be overlooked in existing literature -- the misaligned assessment on different COT reasoning process [+] The proposed method achieved empirical improvement over vanilla finetuning and other baselines on several datasets
[-] The improvements over existing methods seem a little bit incremental. [-] see questions
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsDirect Preference Optimization
