Jailbreaking Large Language Models with Morality Attacks
Ying Su, Mingen Zheng, Weili Diao, Haoran Li

TL;DR
This paper investigates the vulnerability of large language models to morality attacks by developing a dataset and formalizing adversarial attacks, revealing critical weaknesses in moral judgment robustness.
Contribution
It introduces a novel morality dataset and formalizes four adversarial attacks to evaluate LLMs' robustness against morality manipulation.
Findings
LLMs and guardrail models are vulnerable to subtle morality attacks.
Adversarial attacks can manipulate LLMs' moral judgments.
The study highlights the need for improved robustness in moral content generation.
Abstract
Pluralism alignment with AI has the sophisticated and necessary goal of creating AI that can coexist with and serve morally multifaceted humanity. Research towards pluralism alignment has many efforts in enhancing the learning of large language models (LLMs) to accomplish pluralism. Although this is essential, the robustness of LLMs to produce moral content over pluralistic values is still under exploration.Inspired by the astonishing persuasion abilities via jailbreak prompts, we propose to leverage jailbreak attacks to study LLMs' internal pluralistic values. In detail, we develop a morality dataset with 10.3K instances in two categories: Value Ambiguity and Value Conflict. We further formalize four adversarial attacks with the constructed dataset, to manipulate LLMs' judgment over the morality questions. We evaluate both the large language models and guardrail models which are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
