Stay on topic with Classifier-Free Guidance

Guillaume Sanchez; Honglu Fan; Alexander Spangher; Elad Levi; Pawan; Sasanka Ammanamanchi; Stella Biderman

arXiv:2306.17806·cs.CL·July 3, 2023·5 cites

Stay on topic with Classifier-Free Guidance

Guillaume Sanchez, Honglu Fan, Alexander Spangher, Elad Levi, Pawan, Sasanka Ammanamanchi, Stella Biderman

PDF

Open Access 3 Models 3 Reviews

TL;DR

This paper extends Classifier-Free Guidance from text-to-image to pure language modeling, demonstrating its ability to enhance performance, improve coherence, and complement other inference techniques across various NLP tasks.

Contribution

It introduces the novel application of CFG as an inference-time method in language models, showing significant performance gains and compatibility with other techniques.

Findings

01

CFG improves performance across multiple language models and tasks.

02

CFG achieves state-of-the-art results on LAMBADA with smaller models.

03

CFG enhances faithfulness and coherence in human evaluations.

Abstract

Classifier-Free Guidance (CFG) has recently emerged in text-to-image generation as a lightweight technique to encourage prompt-adherence in generations. In this work, we demonstrate that CFG can be used broadly as an inference-time technique in pure language modeling. We show that CFG (1) improves the performance of Pythia, GPT-2 and LLaMA-family models across an array of tasks: Q\&A, reasoning, code generation, and machine translation, achieving SOTA on LAMBADA with LLaMA-7B over PaLM-540B; (2) brings improvements equivalent to a model with twice the parameter-count; (3) can stack alongside other inference-time methods like Chain-of-Thought and Self-Consistency, yielding further improvements in difficult tasks; (4) can be used to increase the faithfulness and coherence of assistants in challenging form-driven and content-driven prompts: in a human evaluation we show a 75\% preference…

Peer Reviews

Decision·ICML 2024 Spotlight

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The proposed method is very straightforward and easy to implement yet effective, requiring only the $\gamma$ multiplier and the second-run of the model. 2. The paper is well written and easy to follow. 3. The experiment performance is impressive and allow a LM to perform nearly as well as a doule-sized one without significant increase in computation cost.

Weaknesses

1. some formatting issues (not necessarily reason to reject): The citation format and style in the submission is not correct. It seems that the authors always use \citet{} instead of \citep{} Some important reference are missing. For example, the original PaLM paper is not cited. In figure 2, some part of the curve is overlapped with the legend. In figure 2, the ticks for the x-axis are not evenly distributed. In table 2, the percentage sign is missing for some numbers. I

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- The idea is simple and reasonable. - This paper conducted extensive experiments to validate the effectiveness of CFG.

Weaknesses

- The \gamma values in one context are poorly suited for another context, making CFG tricky in practice. - Some recent works have explored CFG in language models, weakening the contribution of this paper.

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

[+] The authors suggest new improvements to training in large language models, leading to faster training times and more granular control. [+] The paper has a thorough background section, containing diverse and relevant works to their proposed method. [+] The paper contains extensive comparative results on numerous tasks. [+] The authors provide an insightful computational cost analysis.

Weaknesses

[-] The idea of using CFG is not novel. The authors simply apply this principle to different models. [-] The explanations for why CFG works well for language models are not very solid. I'd like to see more concrete evidence of what is being altered in the model in this training process.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning and Data Classification · Scientific Computing and Data Management

MethodsMulti-Head Attention · Attention Is All You Need · Residual Connection · Weight Decay · Softmax · Dense Connections · Dropout · Layer Normalization · Cosine Annealing · Discriminative Fine-Tuning