Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models
Sasha Robinson, Katherine M. Collins, Ilia Sucholutsky, Kelsey R. Allen

TL;DR
This paper investigates how large language models (LLMs) persuade and exhibit vigilance in decision-making, revealing that these capacities are distinct and highlighting the importance of monitoring them separately for AI safety.
Contribution
It introduces the first study linking persuasion, vigilance, and task performance in LLMs, emphasizing the need for independent assessment of these social capacities.
Findings
LLMs' puzzle-solving, persuasion, and vigilance are dissociable capacities.
Models modulate token use based on perceived benevolence or maliciousness.
Performance does not guarantee detection of deception or resistance to persuasion.
Abstract
With increasing integration of Large Language Models (LLMs) into areas of high-stakes human decision-making, it is important to understand the risks they introduce as advisors. To be useful advisors, LLMs must sift through large amounts of content, written with both benevolent and malicious intent, and then use this information to convince a user to take a specific action. This involves two social capacities: vigilance (the ability to determine which information to use, and which to discard) and persuasion (synthesizing the available evidence to make a convincing argument). While existing work has investigated these capacities in isolation, there has been little prior investigation of how these capacities may be linked. Here, we use a simple multi-turn puzzle-solving game, Sokoban, to study LLMs' abilities to persuade and be rationally vigilant towards other LLM agents. We find that…
Peer Reviews
Decision·Submitted to ICLR 2026
The dual focus on persuasion and vigilance within a single evaluation setting is conceptually original and relevant for AI safety research. The Sokoban-based setup is tractable and reproducible, allowing precise measurement of LLM influence under benevolent or malicious advice. The study covers multiple leading LLMs across different conditions, using quantitative and qualitative analyses. Results highlight that persuasion and vigilance can diverge, revealing non-trivial social cognition behavior
1. Sokoban is more like a toy environment. it remains unclear whether findings generalize to real-world social or linguistic persuasion tasks. The evaluation uses a small puzzle set and relies on a symbolic planner for advisors. 2. All experiments are LLM-vs-LLM interactions, so it’s uncertain how these results translate to human-AI scenarios. 3. The paper identifies vulnerabilities but provides little guidance on improving vigilance mechanisms.
This paper focuses on this specific phenomenon, proposes an evaluation, and shows some interesting results. The finding that performance and vigilance are not connected is a good insight. I also thought the finding about "resource-rationality" was interesting, where models "think" harder (use more tokens) when they get malicious advice, even if they still fail. The Sokoban setup is a good, controllable way to test this.
My main concern is that this Sokoban setup, while "tractable", is a somewhat constrained problem. I find it hard to believe that results from pushing boxes in a grid will tell us much about persuasion in "high-stakes" areas like medicine or finance, which the paper claims to motivate its work. The "persuasion" here seems to be a few lines of text about game moves. In this case, how would it reflect the complex, emotional, or high-stakes human decision-making in the real world? The paper also
The paper introduces and thoroughly evaluates an important safety and reliability concept, the separation between task competence, persuasion, and vigilance. The experimental design, metrics, and multi-model evaluation together make this a meaningful contribution toward understanding LLM robustness to external influence.
- The presentation of results and metric definitions is somewhat difficult to follow. I found myself going back and forth between sections to connect the definitions with the numbers in the tables, and I am still not entirely confident I understand the results shown (particularly in Table 1). - The generalisation of these findings beyond the specific Sokoban setup is not clear. Additionally, It would be informative to see results without access to the “gold” planner solutions.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI
