Modeling Implicit Conflict Monitoring Mechanisms against Stereotypes in LLMs
Jingshen Zhang, Bo Wang, Yanlin Fu, Dongming Zhao, Ruifang He, Yuexian Hou, and Zifei Yu

TL;DR
This paper introduces COCO, a contrastive causal method inspired by neuroscience, to identify and modulate neurons in LLMs for reducing stereotypical biases through self-debiasing mechanisms.
Contribution
It proposes a novel neuron identification method and lightweight, training-free strategies to enhance LLM fairness and safety without impairing generative abilities.
Findings
Deactivating COCO neurons causes over 90% biased outputs.
Lightweight editing strategies improve robustness against jailbreaks.
Methods maintain generative performance while reducing stereotypes.
Abstract
In this paper, we study an emergent self-debiasing mechanisms against stereotypical content in Large Language Models (LLMs). Unlike traditional safety mechanisms that are primarily triggered by explicit input-level stimuli, self-debiasing mechanisms can involve generation-time intrinsic correction that are not directly reducible to surface-level prompt. Motivated by conflict-monitoring and response-inhibition accounts in cognitive neuroscience, we propose COCO, a contrastive causal method designed to identify COCO neurons that exhibit high intra-\underline{CO}nsistency yet sharp inter-\underline{CO}ntrast across antithetical generative responses, such as stereotypical versus unbiased outputs. Ablation studies reveal that deactivating COCO neurons leads to a catastrophic collapse of the model's fairness; over 90\% of outputs revert to biased content, far exceeding the bias levels induced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
