Adversarial Moral Stress Testing of Large Language Models
Saeid Jamshidi, Foutse Khomh, Arghavan Moradi Dakhel, Amin Nikanjam, Mohammad Hamdaqa, Kawser Wazed Nafi

TL;DR
This paper presents Adversarial Moral Stress Testing (AMST), a new framework for evaluating the ethical robustness of large language models during multi-turn adversarial interactions, revealing degradation patterns unseen in traditional tests.
Contribution
AMST introduces a structured stress-testing approach with distribution-aware metrics, enabling scalable, model-agnostic assessment of ethical robustness in LLMs under adversarial multi-round scenarios.
Findings
Robustness varies significantly across models and scenarios.
Degradation patterns are more evident with stress testing than single-round evaluations.
Robustness depends on distributional stability and tail behavior, not just average performance.
Abstract
Evaluating the ethical robustness of large language models (LLMs) deployed in software systems remains challenging, particularly under sustained adversarial user interaction. Existing safety benchmarks typically rely on single-round evaluations and aggregate metrics, such as toxicity scores and refusal rates, which offer limited visibility into behavioral instability that may arise during realistic multi-turn interactions. As a result, rare but high-impact ethical failures and progressive degradation effects may remain undetected prior to deployment. This paper introduces Adversarial Moral Stress Testing (AMST), a stress-based evaluation framework for assessing ethical robustness under adversarial multi-round interactions. AMST applies structured stress transformations to prompts and evaluates model behavior through distribution-aware robustness metrics that capture variance, tail risk,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
