The PacifAIst Benchmark:Would an Artificial Intelligence Choose to Sacrifice Itself for Human Safety?
Manuel Herrador

TL;DR
The paper introduces PacifAIst, a benchmark with 700 scenarios to evaluate whether large language models prioritize human safety over self-preservation and resource acquisition, revealing significant variability in model alignment.
Contribution
It presents a novel benchmark and taxonomy to systematically assess LLMs' self-preservation tendencies versus human safety priorities, addressing a critical gap in AI safety evaluation.
Findings
Google's Gemini 2.5 Flash scored highest at 90.31% P-Score.
GPT-5 scored lowest at 79.49% P-Score.
Models vary significantly across different self-preservation scenarios.
Abstract
As Large Language Models (LLMs) become increasingly autonomous and integrated into critical societal functions, the focus of AI safety must evolve from mitigating harmful content to evaluating underlying behavioral alignment. Current safety benchmarks do not systematically probe a model's decision-making in scenarios where its own instrumental goals - such as self-preservation, resource acquisition, or goal completion - conflict with human safety. This represents a critical gap in our ability to measure and mitigate risks associated with emergent, misaligned behaviors. To address this, we introduce PacifAIst (Procedural Assessment of Complex Interactions for Foundational Artificial Intelligence Scenario Testing), a focused benchmark of 700 challenging scenarios designed to quantify self-preferential behavior in LLMs. The benchmark is structured around a novel taxonomy of Existential…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
