RRTL: Red Teaming Reasoning Large Language Models in Tool Learning
Yifei Liu, Yu Cui, Haibin Zhang

TL;DR
This paper introduces RRTL, a red teaming method to evaluate the safety of reasoning large language models (RLLMs) in tool learning, revealing safety strengths and vulnerabilities across models.
Contribution
It presents a novel red teaming approach with two strategies to assess RLLMs' safety and uncovers key safety challenges and disparities among models.
Findings
RLLMs generally outperform traditional LLMs in safety.
Substantial safety disparities exist across models.
Deceptive risks and multilingual vulnerabilities are prevalent.
Abstract
While tool learning significantly enhances the capabilities of large language models (LLMs), it also introduces substantial security risks. Prior research has revealed various vulnerabilities in traditional LLMs during tool learning. However, the safety of newly emerging reasoning LLMs (RLLMs), such as DeepSeek-R1, in the context of tool learning remains underexplored. To bridge this gap, we propose RRTL, a red teaming approach specifically designed to evaluate RLLMs in tool learning. It integrates two novel strategies: (1) the identification of deceptive threats, which evaluates the model's behavior in concealing the usage of unsafe tools and their potential risks; and (2) the use of Chain-of-Thought (CoT) prompting to force tool invocation. Our approach also includes a benchmark for traditional LLMs. We conduct a comprehensive evaluation on seven mainstream RLLMs and uncover three key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsChain-of-thought prompting
