MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment
John Halloran

TL;DR
This paper addresses MCP vulnerabilities to malicious content by developing improved refusal training methods, including a new dataset, DPO, and RAG-Pref, significantly enhancing LLMs' ability to refuse harmful MCP exploits.
Contribution
It introduces a new dataset of MCP attacks, evaluates DPO's limitations, and proposes RAG-Pref, a novel approach that substantially improves refusal capabilities against MCP-based attacks.
Findings
DPO improves guardrails but varies with model alignment scheme.
RAG-Pref significantly enhances refusal of MCP exploits.
Combining RAG-Pref with DPO yields the best defense against FBAs.
Abstract
The model context protocol (MCP) has been widely adapted as an open standard enabling the seamless integration of generative AI agents. However, recent work has shown the MCP is susceptible to retrieval-based "falsely benign" attacks (FBAs), allowing malicious system access and credential theft, but requiring that users download compromised files directly to their systems. Herein, we show that the threat model of MCP-based attacks is significantly broader than previously thought, i.e., attackers need only post malicious content online to deceive MCP agents into carrying out their attacks on unsuspecting victims' systems. To improve alignment guardrails against such attacks, we introduce a new MCP dataset of FBAs and (truly) benign samples to explore the effectiveness of direct preference optimization (DPO) for the refusal training of large language models (LLMs). While DPO improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Spam and Phishing Detection · Advanced Malware Detection Techniques
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Attention Dropout · Softmax · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · BART · Weight Decay · Multi-Head Attention
