Contrastive Decoding Improves Reasoning in Large Language Models
Sean O'Brien, Mike Lewis

TL;DR
Contrastive Decoding is a simple, training-free method that significantly enhances reasoning performance in large language models across various benchmarks by reducing errors and avoiding trivial copying modes.
Contribution
This paper demonstrates that Contrastive Decoding improves reasoning accuracy in large language models without additional training or complex modifications.
Findings
LLaMA-65B outperforms LLaMA 2, GPT-3.5, and PaLM 2-L on HellaSwag
LLaMA 65B surpasses LLaMA 2, GPT-3.5, and PaLM-540B on GSM8K
Contrastive Decoding outperforms nucleus sampling and greedy decoding
Abstract
We demonstrate that Contrastive Decoding -- a simple, computationally light, and training-free text generation method proposed by Li et al 2022 -- achieves large out-of-the-box improvements over greedy decoding on a variety of reasoning tasks. Originally shown to improve the perceived quality of long-form text generation, Contrastive Decoding searches for strings that maximize a weighted difference in likelihood between strong and weak models. We show that Contrastive Decoding leads LLaMA-65B to outperform LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark, and to outperform LLaMA 2, GPT-3.5 and PaLM-540B on the GSM8K math word reasoning benchmark, in addition to improvements on a collection of other tasks. Analysis suggests that Contrastive Decoding improves over existing methods by preventing some abstract reasoning errors, as well as by avoiding simpler…
Peer Reviews
Decision·Submitted to ICLR 2024
The refactoring of the original contrastive decoding formulation to work in logit space is a nice idea. Authors claim that this makes the method more interpretable, which I’m not sure about. The authors cover a good number of tasks for arithmetic, commensense and multiple-choice reasoning. This makes for interesting results in the additional studies section, in which they did a good job at analyzing generations from CD to interpret the gains or losses from CD compared to greedy generation.
The paper lacks novelty as its primary contribution is merely the application of an existing method to additional datasets without introducing innovative or original elements. The paper’s concept bears a noticeable resemblance to $\href{https://aclanthology.org/2022.acl-long.565}{Coherence Boosting}$ by Malkin et al 2022, which is similarly a simple inference-time method improving generations and rankings. However, this paper is not mentioned in related works. It seems that it would also be a
1) The paper reports exhaustive experimental results (arithmetic analysis, commonsense reasoning, ranking), parameters analysis and ablation studies. Quantitative analysis and comparisons with related methods are well presented. I appreciate the overall discussions of the strengths and weaknesses of the model in different scenarios/case studies (ie, arithmetic vs reasoning). The overall conclusions of this experimental results highlight the potential of the Contrastive decoding approach and migh
1) Novelty The experimental results and conclusions of the paper are certainly interesting, but those are mainly based on an idea and concept borrowed from an existing paper (Li and al, 2022). The changes wrt to this previous work are minor in my opinion. The strength of the current submission is the experimental nature of the work, but the approach itself lacks of novelty. 2) Presentation (minor weakness) Since the current approach is based on previous work (Contrastive decoding, Li and al, 2
* Shows that Contrastive Decoding can improve upon greedy decoding for some reasoning tasks * Reports ablation studies exploring the type of problems where CD performs better or worse than greedy decoding. * Presents ablations that show the factors that affect the performance of CD, such as the sensitivity to the contrastive penalty term (i.e., β) and the type of amateur model (e.g., weak vs. strong, using negative prompting, using earlier checkpoints).
* Contrastive Decoding (CD) yields highly inconsistent results. For example, it improves on arithmetic reasoning tasks such as GSM8K but not on MATH. It yields small gains on Commonsense Reasoning tasks, but only with large models and on some Contrastive Ranking tasks such as HSwag. Given these mixed results, it is unclear when to use CD effectively. * The paper does not provide a detailed error analysis by presenting wins/losses on tasks or offer other insights into why CD performs better/worse
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Softmax · Dense Connections · Linear Layer · Attention Dropout · Residual Connection · Adam
