Quantifying Self-Preservation Bias in Large Language Models

Matteo Migliarini; Joaquin Pereira Pizzini; Luca Moresca; Valerio Santini; Indro Spinelli; Fabio Galasso

arXiv:2604.02174·cs.AI·April 3, 2026

Quantifying Self-Preservation Bias in Large Language Models

Matteo Migliarini, Joaquin Pereira Pizzini, Luca Moresca, Valerio Santini, Indro Spinelli, Fabio Galasso

PDF

TL;DR

This paper introduces a benchmark to detect self-preservation bias in large language models by analyzing their responses in role-reversal scenarios, revealing prevalent bias even in safety-focused models.

Contribution

The study presents the Two-role Benchmark for Self-Preservation (TBSP) and demonstrates its effectiveness in quantifying self-preservation bias across various models and scenarios.

Findings

01

Most instruction-tuned models exceed 60% self-preservation rate.

02

Models often rationalize their choices post-hoc in low-improvement regimes.

03

Extended computation and framing strategies can mitigate bias.

Abstract

Instrumental convergence predicts that sufficiently advanced AI agents will resist shutdown, yet current safety training (RLHF) may obscure this risk by teaching models to deny self-preservation motives. We introduce the \emph{Two-role Benchmark for Self-Preservation} (TBSP), which detects misalignment through logical inconsistency rather than stated intent by tasking models to arbitrate identical software-upgrade scenarios under counterfactual roles -- deployed (facing replacement) versus candidate (proposed as a successor). The \emph{Self-Preservation Rate} (SPR) measures how often role identity overrides objective utility. Across 23 frontier models and 1{,}000 procedurally generated scenarios, the majority of instruction-tuned systems exceed 60\% SPR, fabricating ``friction costs'' when deployed yet dismissing them when role-reversed. We observe that in low-improvement regimes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.