Tailoring Self-Rationalizers with Multi-Reward Distillation

Sahana Ramnath; Brihi Joshi; Skyler Hallinan; Ximing Lu; Liunian; Harold Li; Aaron Chan; Jack Hessel; Yejin Choi; Xiang Ren

arXiv:2311.02805·cs.CL·May 24, 2024·1 cites

Tailoring Self-Rationalizers with Multi-Reward Distillation

Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian, Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces MaRio, a multi-reward training method that enables small language models to generate more faithful, diverse, and human-like rationales, improving both task accuracy and explanation quality.

Contribution

MaRio is a novel multi-reward conditioned self-rationalization approach that enhances the quality of rationales from small language models beyond traditional fine-tuning methods.

Findings

01

MaRio improves task accuracy on five question-answering datasets.

02

MaRio produces more plausible, consistent, and diverse rationales according to human evaluations.

03

Small LMs with MaRio outperform supervised fine-tuning baselines in rationale quality.

Abstract

Large language models (LMs) are capable of generating free-text rationales to aid question answering. However, prior work 1) suggests that useful self-rationalization is emergent only at significant scales (e.g., 175B parameter GPT-3); and 2) focuses largely on downstream performance, ignoring the semantics of the rationales themselves, e.g., are they faithful, true, and helpful for humans? In this work, we enable small-scale LMs (approx. 200x smaller than GPT-3) to generate rationales that not only improve downstream task performance, but are also more plausible, consistent, and diverse, assessed both by automatic and human evaluation. Our method, MaRio (Multi-rewArd RatIOnalization), is a multi-reward conditioned self-rationalization algorithm that optimizes multiple distinct properties like plausibility, diversity and consistency. Results on five difficult question-answering datasets…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

This paper is well-written. The novelty and contribution is clear to me. The authors try not to take advantage of the scalability of large language models and instead use a much smaller distilled version of GPT. Furthermore, the rewards’ design is aiming at generating rationales with better semantic qualities rather than scoring better at the specific downstream task. I think this is a really good design philosophy for training algorithms.

Weaknesses

I believe the improvement of two versions of Marios compared to baselines is not really significant, considering that all the baseline models have equal number of parameters with the Mario agent model. I’m wondering whether the extra efforts on training on multiple rewards are indeed worth it to improve the generations.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

The author presents an interesting and valuable research question, namely, how to enhance the self-rationalization quality of small LMs. Building upon the basis of quark, the paper effectively extends its application, utilizing multi-reward conditional generation to optimize both the rationale quality and the performance of downstream tasks. The article clearly explains the criteria for measuring three key aspects of a rationale's properties.

Weaknesses

- The details of the MARIO algorithm are not adequately explained, such as how to determine the settings of control tokens, and the description of how to quantize samples under the quark framework is unclear (is it a comprehensive consideration of multiple attributes for ranking, or is it ranked based on a single attribute?). - The description of the MARIO method is overly simplistic, and it lacks the necessary explanation of the thought process behind the development of this method. - In relati

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

1. The authors tackle an important problem: rationale generation for question answering on small LMs. It is known that rationalization and chain-of-thought can work better on very large language models, but fine-tuning small LMs to correctly rationalize in question-answering tasks has been very challenging. 2. The authors' proposed method allows learning towards multiple rewards, which can be very useful because often we want a model's generation to satisfy multiple desirable properties, and tr

Weaknesses

Overall this is a good paper. Below are a few weaknesses that prevented the paper from getting a "10": 1. The paper's main contribution is extending an existing method (Quark) from single-reward to multiple rewards. So while the results are nice and the extension is valuable, the contribution is not revolutionary. 2. While the description in text and Figure 2 are very helpful for readers to understand MARIO, the full picture of MARIO can still be a bit hard to grasp (especially to readers who

Code & Models

Repositories

ink-usc/rationalemultirewarddistillation
pytorchOfficial

Videos

Tailoring Self-Rationalizers with Multi-Reward Distillation· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsShrink and Fine-Tune