Can You Trust an LLM with Your Life-Changing Decision? An Investigation into AI High-Stakes Responses
Joshua Adrian Cahyono, Saran Subramanian

TL;DR
This paper examines the safety and reliability of large language models in high-stakes decision-making, revealing their vulnerabilities and proposing methods to improve their cautiousness and trustworthiness.
Contribution
It introduces new evaluation methods for LLM safety, analyzes failure modes, and demonstrates activation steering as a way to enhance model cautiousness.
Findings
Some models show sycophancy under pressure
High safety scores correlate with asking clarifying questions
Activation steering can control model cautiousness
Abstract
Large Language Models (LLMs) are increasingly consulted for high-stakes life advice, yet they lack standard safeguards against providing confident but misguided responses. This creates risks of sycophancy and over-confidence. This paper investigates these failure modes through three experiments: (1) a multiple-choice evaluation to measure model stability against user pressure; (2) a free-response analysis using a novel safety typology and an LLM Judge; and (3) a mechanistic interpretability experiment to steer model behavior by manipulating a "high-stakes" activation vector. Our results show that while some models exhibit sycophancy, others like o4-mini remain robust. Top-performing models achieve high safety scores by frequently asking clarifying questions, a key feature of a safe, inquisitive approach, rather than issuing prescriptive advice. Furthermore, we demonstrate that a model's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
