Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security

Gabriel Chua

arXiv:2507.19399·cs.CR·July 28, 2025

Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security

Gabriel Chua

PDF

Open Access 1 Datasets

TL;DR

This paper introduces CIRCLE, a benchmark to evaluate security risks in LLM code interpreters, revealing significant vulnerabilities and disparities across models, emphasizing the need for better safeguards and standards.

Contribution

The paper presents CIRCLE, a novel benchmark for systematically assessing cybersecurity risks in LLM code interpreters, including a large set of prompts and an automated evaluation framework.

Findings

01

Models show significant vulnerability disparities.

02

Indirect prompts weaken defenses substantially.

03

OpenAI's o4-mini outperforms GPT-4.1 in refusal rates.

Abstract

As large language models (LLMs) increasingly integrate native code interpreters, they enable powerful real-time execution capabilities, substantially expanding their utility. However, such integrations introduce potential system-level cybersecurity threats, fundamentally different from prompt-based vulnerabilities. To systematically evaluate these interpreter-specific risks, we propose CIRCLE (Code-Interpreter Resilience Check for LLM Exploits), a simple benchmark comprising 1,260 prompts targeting CPU, memory, and disk resource exhaustion. Each risk category includes explicitly malicious ("direct") and plausibly benign ("indirect") prompt variants. Our automated evaluation framework assesses not only whether LLMs refuse or generates risky code, but also executes the generated code within the interpreter environment to evaluate code correctness, simplifications made by the LLM to make…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

govtech/CIRCLE
dataset· 9 dl
9 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecurity and Verification in Computing · Adversarial Robustness in Machine Learning · Software Testing and Debugging Techniques