R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao,, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang,, Gongshen Liu

TL;DR
This paper introduces R-Judge, a comprehensive benchmark for evaluating the safety risk awareness of large language model (LLM) agents across diverse scenarios, highlighting current limitations and avenues for improvement.
Contribution
The paper presents R-Judge, a new benchmark with annotated safety labels for assessing LLMs' ability to identify safety risks in multi-turn interactions.
Findings
GPT-4o achieves 74.42% accuracy on R-Judge.
Risk awareness involves knowledge and reasoning, not just simple prompts.
Fine-tuning improves risk judgment more effectively than prompting.
Abstract
Large language models (LLMs) have exhibited great potential in autonomously completing tasks across real-world applications. Despite this, these LLM agents introduce unexpected safety risks when operating in interactive environments. Instead of centering on the harmlessness of LLM-generated content in most prior studies, this work addresses the imperative need for benchmarking the behavioral safety of LLM agents within diverse environments. We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction records. R-Judge comprises 569 records of multi-turn agent interaction, encompassing 27 key risk scenarios among 5 application categories and 10 risk types. It is of high-quality curation with annotated safety labels and risk descriptions. Evaluation of 11 LLMs on R-Judge shows considerable room for enhancing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Absolute Position Encodings · Layer Normalization · Dropout · Linear Layer · Byte Pair Encoding · Softmax · Adam
