Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework

Rishane Dassanayake; Mario Demetroudi; James Walpole; Lindley Lentati; Jason R. Brown; Edward James Young

arXiv:2507.12872·cs.AI·July 18, 2025

Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework

Rishane Dassanayake, Mario Demetroudi, James Walpole, Lindley Lentati, Jason R. Brown, Edward James Young

PDF

Open Access

TL;DR

This paper analyzes the threat of manipulation attacks by misaligned AI systems, proposing a safety framework to assess and mitigate these risks in AI deployment.

Contribution

It introduces the first systematic safety case framework specifically designed for evaluating manipulation risks in frontier AI systems.

Findings

01

Identifies manipulation as a significant, underexplored AI threat.

02

Provides a structured safety case framework with evidence and evaluation guidelines.

03

Offers practical implementation considerations for AI companies.

Abstract

Frontier AI systems are rapidly advancing in their capabilities to persuade, deceive, and influence human behaviour, with current models already demonstrating human-level persuasion and strategic deception in specific contexts. Humans are often the weakest link in cybersecurity systems, and a misaligned AI system deployed internally within a frontier company may seek to undermine human oversight by manipulating employees. Despite this growing threat, manipulation attacks have received little attention, and no systematic framework exists for assessing and mitigating these risks. To address this, we provide a detailed explanation of why manipulation attacks are a significant threat and could lead to catastrophic outcomes. Additionally, we present a safety case framework for manipulation risk, structured around three core lines of argument: inability, control, and trustworthiness. For each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Safety Systems Engineering in Autonomy