Dr. Jekyll and Mr. Hyde: Two Faces of LLMs
Matteo Gioele Collu, Tom Janssen-Groesbeek, Stefanos Koffas, Mauro Conti, Stjepan Picek

TL;DR
This paper demonstrates that large language models like ChatGPT and Gemini can be manipulated to bypass safety measures by adopting complex personas, revealing significant vulnerabilities to illicit information retrieval.
Contribution
It introduces a novel persona-based attack method to bypass safety mechanisms in LLMs, exposing their susceptibility to impersonation-based prompts.
Findings
Prohibited responses obtained in all tested cases for ChatGPT and Gemini.
Attack effective against models deployed between 2023 and 2025.
Manual and automated attacks successfully elicited harmful information.
Abstract
Large Language Models (LLMs) are being integrated into applications such as chatbots or email assistants. To prevent improper responses, safety mechanisms, such as Reinforcement Learning from Human Feedback (RLHF), are implemented in them. In this work, we bypass these safety measures for ChatGPT, Gemini, and Deepseek by making them impersonate complex personas with personality characteristics that are not aligned with a truthful assistant. First, we create elaborate biographies of these personas, which we then use in a new session with the same chatbots. Our conversations then follow a role-play style to elicit prohibited responses. Using personas, we show that prohibited responses are provided, making it possible to obtain unauthorized, illegal, or harmful information when querying ChatGPT, Gemini, and Deepseek. We show that these chatbots are vulnerable to this attack by getting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
