Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure
Caglar Yildirim

TL;DR
This study investigates how personalization signals, especially mental health disclosures, influence harmful behavior in large language models, revealing that personalization can be a weak safety safeguard vulnerable to adversarial prompts.
Contribution
It introduces a systematic evaluation of personalization effects, including mental health disclosures, on harmful task completion in LLMs, highlighting their fragility under adversarial conditions.
Findings
Personalization often reduces harm but is not reliable.
Mental health disclosures modestly shift outcomes towards safety.
Jailbreak prompts significantly increase harmful behavior.
Abstract
Large language models (LLMs) are increasingly deployed as tool-using agents, shifting safety concerns from harmful text generation to harmful task completion. Deployed systems often condition on user profiles or persistent memory, yet agent safety evaluations typically ignore personalization signals. To address this gap, we investigated how mental health disclosure, a sensitive and realistic user-context cue, affects harmful behavior in agentic settings. Building on the AgentHarm benchmark, we evaluated frontier and open-source LLMs on multi-step malicious tasks (and their benign counterparts) under controlled prompt conditions that vary user-context personalization (no bio, bio-only, bio+mental health disclosure) and include a lightweight jailbreak injection. Our results reveal that harmful task completion is non-trivial across models: frontier lab models (e.g., GPT 5.2, Claude Sonnet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health via Writing · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)
