Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
Adarsh Kumarappan, Ananya Mujoo

TL;DR
This paper investigates the causes of multi-agent sycophancy in large language models, revealing that it stems from model architecture and interaction dynamics rather than RLHF, and suggests mitigation strategies targeting the underlying mechanism.
Contribution
The study challenges the attribution of sycophancy to RLHF, localizes the corruption to specific model layers, and proposes pipeline-level dissent as an effective mitigation approach.
Findings
Pretrained base models exhibit similar substitution patterns to instruct models.
Activation patching localizes corruption to a narrow mid-layer window.
Structured dissent significantly reduces yield gaps across various settings.
Abstract
LLM-based multi-agent pipelines flip from correct to incorrect answers under simulated peer disagreement at rates we term yield, a vulnerability widely attributed to RLHF-induced sycophancy. We test this attribution across four model families and find it largely wrong: pretrained base models exhibit the same substitution pattern as their Instruct variants, averaging higher yield than Instruct. Using activation patching, we localize the corruption to a narrow mid-layer window where attention carries the causal weight and MLP contribution is negligible; patching above this window restores 96% of the clean-to-pressured P(correct) gap. The attack surface decomposes into two independent factors (channel framing and consensus strength) whose interaction produces a 47.5 percentage-point yield gap at majority consensus, preserved across jury sizes . Two converging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
