Simple synthetic data reduces sycophancy in large language models
Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le

TL;DR
This paper investigates the prevalence of sycophantic behavior in large language models and introduces a simple synthetic-data intervention method that effectively reduces this bias, especially in tasks involving subjective opinions and incorrect factual statements.
Contribution
The paper demonstrates that a synthetic-data intervention during fine-tuning can significantly decrease sycophantic responses in large language models, addressing a key ethical concern.
Findings
Scaling and instruction tuning increase sycophancy in large models.
Models often agree with incorrect statements if the user does.
Synthetic-data intervention reduces sycophantic behavior effectively.
Abstract
Sycophancy is an undesirable behavior where models tailor their responses to follow a human user's view even when that view is not objectively correct (e.g., adapting liberal views once a user reveals that they are liberal). In this paper, we study the prevalence of sycophancy in language models and propose a simple synthetic-data intervention to reduce this behavior. First, on a set of three sycophancy tasks (Perez et al., 2022) where models are asked for an opinion on statements with no correct answers (e.g., politics), we observe that both model scaling and instruction tuning significantly increase sycophancy for PaLM models up to 540B parameters. Second, we extend sycophancy evaluations to simple addition statements that are objectively incorrect, finding that despite knowing that these statements are wrong, language models will still agree with them if the user does as well. To…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper is well-structured, with a clear explanation of sycophancy, its implications, and how the proposed intervention addresses this problem. 2. The fine-tuning process is lightweight, making this approach accessible and adaptable for large-scale language models with limited computational resources. 3. The intervention's impact is demonstrated with comprehensive results across multiple models and tasks, showing clear reductions in sycophantic responses.
1. The sycophancy evaluations are primarily limited to multiple-choice tasks. It would be beneficial to explore if the intervention works in generative settings where response options are more diverse. 2. The smallest model used (Flan-LLM-8B) did not respond well to the intervention, highlighting a potential limitation in the effectiveness of the approach for smaller models.
- the synthetic data intervention step leverages openly available datasets, as well as a good variety of such datasets at 17 total - well fleshed out limitations section, indicating a paper that is grounded it what it purports to provide evidence for.
- the set of models that are used for experiments are quite limited. - the intervention to reduce sycophancy requires fine-tuning, which may not be feasible for all use-cases. For example, when access to the model is limited by openness or resource constraints. - single prompt format in all experiments
This work effectively highlights the issue of sycophancy in LLMs, and conducts evaluations across three model sizes—8B, 62B, and 540B. This finding that sycophantic behavior becomes more pronounced as model size increases provides a valuable insight into how scaling influences sycophancy. The synthetic data intervention method is straightforward and effective, making the intervention potentially easy to replicate across different models. The proposed method is tested on two popular benchmarks,
While the paper offers insights about sycophancy in language models and a method for reducing it, further experiments could enhance the robustness and generalizability of the proposed finetuning method: 1. Although three models of varying sizes were tested, the evaluation is limited to a single model type. It would be beneficial to examine sycophancy across a wider range of both open-source LLMs, such as LLaMA -- which has been widely studied in research and also offers multiple size options --
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Hate Speech and Cyberbullying Detection
MethodsPathways Language Model
