KNIFE: Distilling Reasoning Knowledge From Free-Text Rationales
Aaron Chan, Zhiyuan Zeng, Wyatt Lake, Brihi Joshi, Hanjie Chen, Xiang, Ren

TL;DR
KNIFE is a novel method that distills reasoning knowledge from free-text rationales into smaller language models, significantly improving their reasoning performance without requiring large models or direct rationale input.
Contribution
The paper introduces KNIFE, a technique that effectively transfers reasoning knowledge from large models with rationales into smaller models, enhancing reasoning capabilities.
Findings
KNIFE outperforms baselines on question-answering tasks.
FTR quality significantly impacts KNIFE's effectiveness.
Small models can learn reasoning skills from large models using KNIFE.
Abstract
Language models (LMs) have yielded impressive results on many language reasoning tasks, but their unexpected errors raise doubts about their reasoning abilities. In light of this, there is growing interest in finetuning/prompting LMs with both task instances and their associated free-text rationales (FTRs), which explain the correct reasoning process for predicting the correct task output (i.e., how to be "right for the right reasons"). However, existing finetuning methods fail to improve LM performance, while prompting needs prohibitively large (i.e., >50B) LMs to work well. We propose KNIFE, which shows that reasoning knowledge can be effectively distilled from FTRs into a small (i.e., <1B) LM and improve the LM's performance. First, KNIFE finetunes a teacher LM (given task input and FTR) to predict the task output, transferring reasoning knowledge from the FTRs to the teacher's…
Peer Reviews
Decision·Submitted to ICLR 2024
- **New Approach to a Specific Problem;** The proposed method represents a novel approach to a specialized problem where either human-written or machine-generated free-text rationales are available, and the language model architecture is based on an encoder-decoder system like T5. As far as I am aware, this particular issue hasn’t been addressed in previous works. - **Thorough Analysis;** The authors conducted comprehensive experiments under various conditions, including two different sizes of
- **Limited Contribution;** While I acknowledge the novelty and design of the prospoed method aimed at distilling reasoning knowledge from free-text rationales in encoder-decoder LMs, its contribution appears limited for several reasons: - The efficacy of the proposed method, KNIFE, seems marginal as the improvements are not statistically significant based on some results in Table 1. For instance, in the StrategyQA dataset, KNIFE’s performance is comparable to FT (I→RO), suggesting that simp
- The scenario is likely: there are rationale annotations for training data, but these annotations are not readily available at test time. - The method is intuitive and effective. The bottleneck architecture has novelty. - Compared with multiple baselines and performed careful ablation study.
- The paper is about classification and shows advantages on two datasets. Results on more tasks will be helpful for showing the generality of the method. Does the method work for reasoning tasks commonly used by the chain-of-thought literature, such as, arithmetic reasoning, commonsense reasoning, and code generation? Does it work for knowledge-intensive tasks? - The paper doesn't have a retrieval-augmentation baseline. Will the numbers look better if you finetune T5 to learn to condition on ret
They study the idea of knowledge distillation from free-text rationales and did comprehensive experiments to show the effectiveness of the approach.
* The idea behind this approach isn't very convincing. The teacher model can't store a lot of knowledge, and it might not work well for different tasks. Plus, it's unclear how this method is better than retrieval-augmented generation. * To prove its effectiveness, more experiments should be done, comparing it to retrieval-augmented generation and testing it on various downstream tasks. * The improvement from introducing free-text rationale into the teacher model isn't substantial, and it might b
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
Methodsfail · ALIGN
