Tailoring Self-Rationalizers with Multi-Reward Distillation
Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian, Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren

TL;DR
This paper introduces MaRio, a multi-reward training method that enables small language models to generate more faithful, diverse, and human-like rationales, improving both task accuracy and explanation quality.
Contribution
MaRio is a novel multi-reward conditioned self-rationalization approach that enhances the quality of rationales from small language models beyond traditional fine-tuning methods.
Findings
MaRio improves task accuracy on five question-answering datasets.
MaRio produces more plausible, consistent, and diverse rationales according to human evaluations.
Small LMs with MaRio outperform supervised fine-tuning baselines in rationale quality.
Abstract
Large language models (LMs) are capable of generating free-text rationales to aid question answering. However, prior work 1) suggests that useful self-rationalization is emergent only at significant scales (e.g., 175B parameter GPT-3); and 2) focuses largely on downstream performance, ignoring the semantics of the rationales themselves, e.g., are they faithful, true, and helpful for humans? In this work, we enable small-scale LMs (approx. 200x smaller than GPT-3) to generate rationales that not only improve downstream task performance, but are also more plausible, consistent, and diverse, assessed both by automatic and human evaluation. Our method, MaRio (Multi-rewArd RatIOnalization), is a multi-reward conditioned self-rationalization algorithm that optimizes multiple distinct properties like plausibility, diversity and consistency. Results on five difficult question-answering datasets…
Peer Reviews
Decision·ICLR 2024 poster
This paper is well-written. The novelty and contribution is clear to me. The authors try not to take advantage of the scalability of large language models and instead use a much smaller distilled version of GPT. Furthermore, the rewards’ design is aiming at generating rationales with better semantic qualities rather than scoring better at the specific downstream task. I think this is a really good design philosophy for training algorithms.
I believe the improvement of two versions of Marios compared to baselines is not really significant, considering that all the baseline models have equal number of parameters with the Mario agent model. I’m wondering whether the extra efforts on training on multiple rewards are indeed worth it to improve the generations.
The author presents an interesting and valuable research question, namely, how to enhance the self-rationalization quality of small LMs. Building upon the basis of quark, the paper effectively extends its application, utilizing multi-reward conditional generation to optimize both the rationale quality and the performance of downstream tasks. The article clearly explains the criteria for measuring three key aspects of a rationale's properties.
- The details of the MARIO algorithm are not adequately explained, such as how to determine the settings of control tokens, and the description of how to quantize samples under the quark framework is unclear (is it a comprehensive consideration of multiple attributes for ranking, or is it ranked based on a single attribute?). - The description of the MARIO method is overly simplistic, and it lacks the necessary explanation of the thought process behind the development of this method. - In relati
1. The authors tackle an important problem: rationale generation for question answering on small LMs. It is known that rationalization and chain-of-thought can work better on very large language models, but fine-tuning small LMs to correctly rationalize in question-answering tasks has been very challenging. 2. The authors' proposed method allows learning towards multiple rewards, which can be very useful because often we want a model's generation to satisfy multiple desirable properties, and tr
Overall this is a good paper. Below are a few weaknesses that prevented the paper from getting a "10": 1. The paper's main contribution is extending an existing method (Quark) from single-reward to multiple rewards. So while the results are nice and the extension is valuable, the contribution is not revolutionary. 2. While the description in text and Figure 2 are very helpful for readers to understand MARIO, the full picture of MARIO can still be a bit hard to grasp (especially to readers who
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsShrink and Fine-Tune
