RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments
Yuchuan Fu, Xiaohan Yuan, Dongxia Wang

TL;DR
RAS-Eval is a comprehensive benchmark designed to evaluate the security vulnerabilities of LLM agents in real-world environments, revealing significant risks and guiding future security improvements.
Contribution
Introduces RAS-Eval, a standardized, extensive security benchmark for LLM agents, including diverse test cases and attack scenarios across multiple CWE categories.
Findings
Scaling laws apply to security capabilities, with larger models being more robust.
Attacks significantly reduce agent task completion rates, averaging a 36.78% decrease.
High attack success rate of 85.65% in academic settings.
Abstract
The rapid deployment of Large language model (LLM) agents in critical domains like healthcare and finance necessitates robust security frameworks. To address the absence of standardized evaluation benchmarks for these agents in dynamic environments, we introduce RAS-Eval, a comprehensive security benchmark supporting both simulated and real-world tool execution. RAS-Eval comprises 80 test cases and 3,802 attack tasks mapped to 11 Common Weakness Enumeration (CWE) categories, with tools implemented in JSON, LangGraph, and Model Context Protocol (MCP) formats. We evaluate 6 state-of-the-art LLMs across diverse scenarios, revealing significant vulnerabilities: attacks reduced agent task completion rates (TCR) by 36.78% on average and achieved an 85.65% success rate in academic settings. Notably, scaling laws held for security capabilities, with larger models outperforming smaller…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection · Advanced Malware Detection Techniques · Information and Cyber Security
