Pressure Reveals Character: Behavioural Alignment Evaluation at Depth
Nora Petrova, John Burden

TL;DR
This paper introduces a comprehensive multi-turn alignment benchmark for language models, revealing that even top models exhibit significant weaknesses across various behavioral categories under realistic pressure scenarios.
Contribution
The paper presents a new benchmark with 904 scenarios validated by humans, and demonstrates that alignment is a unified trait across different behavioral categories in language models.
Findings
Models show consistent weaknesses across categories.
Alignment behaves as a unified construct similar to a g-factor.
Top models still exhibit notable gaps in specific alignment areas.
Abstract
Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking. We introduce an alignment benchmark spanning 904 scenarios across six categories -- Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming -- validated as realistic by human raters. Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss. Evaluating 24 frontier models using LLM judges validated against human annotations, we find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses across the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Authorship Attribution and Profiling · Natural Language Processing Techniques
