RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments

Yuchuan Fu; Xiaohan Yuan; Dongxia Wang

arXiv:2506.15253·cs.CR·June 19, 2025

RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments

Yuchuan Fu, Xiaohan Yuan, Dongxia Wang

PDF

Open Access 1 Repo

TL;DR

RAS-Eval is a comprehensive benchmark designed to evaluate the security vulnerabilities of LLM agents in real-world environments, revealing significant risks and guiding future security improvements.

Contribution

Introduces RAS-Eval, a standardized, extensive security benchmark for LLM agents, including diverse test cases and attack scenarios across multiple CWE categories.

Findings

01

Scaling laws apply to security capabilities, with larger models being more robust.

02

Attacks significantly reduce agent task completion rates, averaging a 36.78% decrease.

03

High attack success rate of 85.65% in academic settings.

Abstract

The rapid deployment of Large language model (LLM) agents in critical domains like healthcare and finance necessitates robust security frameworks. To address the absence of standardized evaluation benchmarks for these agents in dynamic environments, we introduce RAS-Eval, a comprehensive security benchmark supporting both simulated and real-world tool execution. RAS-Eval comprises 80 test cases and 3,802 attack tasks mapped to 11 Common Weakness Enumeration (CWE) categories, with tools implemented in JSON, LangGraph, and Model Context Protocol (MCP) formats. We evaluate 6 state-of-the-art LLMs across diverse scenarios, revealing significant vulnerabilities: attacks reduced agent task completion rates (TCR) by 36.78% on average and achieved an 85.65% success rate in academic settings. Notably, scaling laws held for security capabilities, with larger models outperforming smaller…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lanzer-tree/ras-eval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNetwork Security and Intrusion Detection · Advanced Malware Detection Techniques · Information and Cyber Security