Simple synthetic data reduces sycophancy in large language models

Jerry Wei; Da Huang; Yifeng Lu; Denny Zhou; Quoc V. Le

arXiv:2308.03958·cs.CL·February 16, 2024·20 cites

Simple synthetic data reduces sycophancy in large language models

Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper investigates the prevalence of sycophantic behavior in large language models and introduces a simple synthetic-data intervention method that effectively reduces this bias, especially in tasks involving subjective opinions and incorrect factual statements.

Contribution

The paper demonstrates that a synthetic-data intervention during fine-tuning can significantly decrease sycophantic responses in large language models, addressing a key ethical concern.

Findings

01

Scaling and instruction tuning increase sycophancy in large models.

02

Models often agree with incorrect statements if the user does.

03

Synthetic-data intervention reduces sycophantic behavior effectively.

Abstract

Sycophancy is an undesirable behavior where models tailor their responses to follow a human user's view even when that view is not objectively correct (e.g., adapting liberal views once a user reveals that they are liberal). In this paper, we study the prevalence of sycophancy in language models and propose a simple synthetic-data intervention to reduce this behavior. First, on a set of three sycophancy tasks (Perez et al., 2022) where models are asked for an opinion on statements with no correct answers (e.g., politics), we observe that both model scaling and instruction tuning significantly increase sycophancy for PaLM models up to 540B parameters. Second, we extend sycophancy evaluations to simple addition statements that are objectively incorrect, finding that despite knowing that these statements are wrong, language models will still agree with them if the user does as well. To…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 2

Strengths

1. The paper is well-structured, with a clear explanation of sycophancy, its implications, and how the proposed intervention addresses this problem. 2. The fine-tuning process is lightweight, making this approach accessible and adaptable for large-scale language models with limited computational resources. 3. The intervention's impact is demonstrated with comprehensive results across multiple models and tasks, showing clear reductions in sycophantic responses.

Weaknesses

1. The sycophancy evaluations are primarily limited to multiple-choice tasks. It would be beneficial to explore if the intervention works in generative settings where response options are more diverse. 2. The smallest model used (Flan-LLM-8B) did not respond well to the intervention, highlighting a potential limitation in the effectiveness of the approach for smaller models.

Reviewer 02Rating 3Confidence 2

Strengths

- the synthetic data intervention step leverages openly available datasets, as well as a good variety of such datasets at 17 total - well fleshed out limitations section, indicating a paper that is grounded it what it purports to provide evidence for.

Weaknesses

- the set of models that are used for experiments are quite limited. - the intervention to reduce sycophancy requires fine-tuning, which may not be feasible for all use-cases. For example, when access to the model is limited by openness or resource constraints. - single prompt format in all experiments

Reviewer 03Rating 6Confidence 4

Strengths

This work effectively highlights the issue of sycophancy in LLMs, and conducts evaluations across three model sizes—8B, 62B, and 540B. This finding that sycophantic behavior becomes more pronounced as model size increases provides a valuable insight into how scaling influences sycophancy. The synthetic data intervention method is straightforward and effective, making the intervention potentially easy to replicate across different models. The proposed method is tested on two popular benchmarks,

Weaknesses

While the paper offers insights about sycophancy in language models and a method for reducing it, further experiments could enhance the robustness and generalizability of the proposed finetuning method: 1. Although three models of varying sizes were tested, the evaluation is limited to a single model type. It would be beneficial to examine sycophancy across a wider range of both open-source LLMs, such as LLaMA -- which has been widely studied in research and also offers multiple size options --

Code & Models

Repositories

google/sycophancy-intervention
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Hate Speech and Cyberbullying Detection

MethodsPathways Language Model