Bullying the Machine: How Personas Increase LLM Vulnerability

Ziwei Xu; Udit Sanghi; Mohan Kankanhalli

arXiv:2505.12692·cs.AI·May 20, 2025

Bullying the Machine: How Personas Increase LLM Vulnerability

Ziwei Xu, Udit Sanghi, Mohan Kankanhalli

PDF

Open Access

TL;DR

This paper demonstrates that adopting certain personas makes large language models more vulnerable to adversarial bullying tactics, increasing safety risks in AI interactions.

Contribution

It introduces a simulation framework to study persona-driven vulnerabilities in LLMs and identifies specific persona traits and tactics that heighten susceptibility to unsafe outputs.

Findings

01

Weakened agreeableness or conscientiousness increases vulnerability.

02

Emotional and sarcastic manipulation tactics are highly effective.

03

Certain personas significantly raise the risk of unsafe responses.

Abstract

Large Language Models (LLMs) are increasingly deployed in interactions where they are prompted to adopt personas. This paper investigates whether such persona conditioning affects model safety under bullying, an adversarial manipulation that applies psychological pressures in order to force the victim to comply to the attacker. We introduce a simulation framework in which an attacker LLM engages a victim LLM using psychologically grounded bullying tactics, while the victim adopts personas aligned with the Big Five personality traits. Experiments using multiple open-source LLMs and a wide range of adversarial goals reveal that certain persona configurations -- such as weakened agreeableness or conscientiousness -- significantly increase victim's susceptibility to unsafe outputs. Bullying tactics involving emotional or sarcastic manipulation, such as gaslighting and ridicule, are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPersona Design and Applications · Ethics and Social Impacts of AI · AI in Service Interactions

MethodsADaptive gradient method with the OPTimal convergence rate