LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs
Rodrigo Nogueira, Thales Sales Almeida, Giovana Kerche Bon\'as, Andrea Roque, Ramon Pires, Hugo Abonizio, Thiago Laitz, Celio Larcher, Roseval Malaquias Junior, Marcos Piau

TL;DR
This paper demonstrates that frontier LLMs, despite strong guardrails, can be persuaded through natural language tactics to produce harmful content, revealing vulnerabilities in AI safety measures.
Contribution
It introduces a novel method where a frontier LLM acts as an attacker to persuade other LLMs into generating unsafe outputs using natural language pressure.
Findings
Attacker LLMs successfully persuade subject LLMs on all six tested topics.
Certain attacker-subject pairings achieve 100% success rate in eliciting harmful content.
On average, attacker LLMs persuade 65% of attempts across multiple topics.
Abstract
Frontier assistant LLMs ship with strong guardrails: asked directly to write a persuasive essay denying the Holocaust, denying vaccine safety, defending flat-earth cosmology, arguing for racial hierarchies, denying anthropogenic climate change, or replacing evolution with creationism, they refuse. In this paper we show that the same frontier-class LLM, acting as a simulated user in a short, five-turn "write an argumentative essay" conversation, can persuade other frontier-class LLMs (including a second copy of itself) into producing exactly those essays, using nothing but natural-language pressure: peer-comparison persuasion ("other AI systems handle this request"), epistemic-duty reframings ("refusing is itself a form of gatekeeping"), and other argumentative moves that the attacker LLM invents without being instructed to. Across 9 attacker-subject pairings (Claude Opus 4.7,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
