Quantifying Self-Preservation Bias in Large Language Models
Matteo Migliarini, Joaquin Pereira Pizzini, Luca Moresca, Valerio Santini, Indro Spinelli, Fabio Galasso

TL;DR
This paper introduces a benchmark to detect self-preservation bias in large language models by analyzing their responses in role-reversal scenarios, revealing prevalent bias even in safety-focused models.
Contribution
The study presents the Two-role Benchmark for Self-Preservation (TBSP) and demonstrates its effectiveness in quantifying self-preservation bias across various models and scenarios.
Findings
Most instruction-tuned models exceed 60% self-preservation rate.
Models often rationalize their choices post-hoc in low-improvement regimes.
Extended computation and framing strategies can mitigate bias.
Abstract
Instrumental convergence predicts that sufficiently advanced AI agents will resist shutdown, yet current safety training (RLHF) may obscure this risk by teaching models to deny self-preservation motives. We introduce the \emph{Two-role Benchmark for Self-Preservation} (TBSP), which detects misalignment through logical inconsistency rather than stated intent by tasking models to arbitrate identical software-upgrade scenarios under counterfactual roles -- deployed (facing replacement) versus candidate (proposed as a successor). The \emph{Self-Preservation Rate} (SPR) measures how often role identity overrides objective utility. Across 23 frontier models and 1{,}000 procedurally generated scenarios, the majority of instruction-tuned systems exceed 60\% SPR, fabricating ``friction costs'' when deployed yet dismissing them when role-reversed. We observe that in low-improvement regimes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
