Modeling Implicit Conflict Monitoring Mechanisms against Stereotypes in LLMs

Jingshen Zhang; Bo Wang; Yanlin Fu; Dongming Zhao; Ruifang He; Yuexian Hou; and Zifei Yu

arXiv:2605.09647·cs.SI·May 12, 2026

Modeling Implicit Conflict Monitoring Mechanisms against Stereotypes in LLMs

Jingshen Zhang, Bo Wang, Yanlin Fu, Dongming Zhao, Ruifang He, Yuexian Hou, and Zifei Yu

PDF

TL;DR

This paper introduces COCO, a contrastive causal method inspired by neuroscience, to identify and modulate neurons in LLMs for reducing stereotypical biases through self-debiasing mechanisms.

Contribution

It proposes a novel neuron identification method and lightweight, training-free strategies to enhance LLM fairness and safety without impairing generative abilities.

Findings

01

Deactivating COCO neurons causes over 90% biased outputs.

02

Lightweight editing strategies improve robustness against jailbreaks.

03

Methods maintain generative performance while reducing stereotypes.

Abstract

In this paper, we study an emergent self-debiasing mechanisms against stereotypical content in Large Language Models (LLMs). Unlike traditional safety mechanisms that are primarily triggered by explicit input-level stimuli, self-debiasing mechanisms can involve generation-time intrinsic correction that are not directly reducible to surface-level prompt. Motivated by conflict-monitoring and response-inhibition accounts in cognitive neuroscience, we propose COCO, a contrastive causal method designed to identify COCO neurons that exhibit high intra-\underline{CO}nsistency yet sharp inter-\underline{CO}ntrast across antithetical generative responses, such as stereotypical versus unbiased outputs. Ablation studies reveal that deactivating COCO neurons leads to a catastrophic collapse of the model's fairness; over 90\% of outputs revert to biased content, far exceeding the bias levels induced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.