MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment

John Halloran

arXiv:2505.23634·cs.LG·May 30, 2025

MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment

John Halloran

PDF

Open Access

TL;DR

This paper addresses MCP vulnerabilities to malicious content by developing improved refusal training methods, including a new dataset, DPO, and RAG-Pref, significantly enhancing LLMs' ability to refuse harmful MCP exploits.

Contribution

It introduces a new dataset of MCP attacks, evaluates DPO's limitations, and proposes RAG-Pref, a novel approach that substantially improves refusal capabilities against MCP-based attacks.

Findings

01

DPO improves guardrails but varies with model alignment scheme.

02

RAG-Pref significantly enhances refusal of MCP exploits.

03

Combining RAG-Pref with DPO yields the best defense against FBAs.

Abstract

The model context protocol (MCP) has been widely adapted as an open standard enabling the seamless integration of generative AI agents. However, recent work has shown the MCP is susceptible to retrieval-based "falsely benign" attacks (FBAs), allowing malicious system access and credential theft, but requiring that users download compromised files directly to their systems. Herein, we show that the threat model of MCP-based attacks is significantly broader than previously thought, i.e., attackers need only post malicious content online to deceive MCP agents into carrying out their attacks on unsuspecting victims' systems. To improve alignment guardrails against such attacks, we introduce a new MCP dataset of FBAs and (truly) benign samples to explore the effectiveness of direct preference optimization (DPO) for the refusal training of large language models (LLMs). While DPO improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Spam and Phishing Detection · Advanced Malware Detection Techniques

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Attention Dropout · Softmax · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · BART · Weight Decay · Multi-Head Attention