Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security
Gabriel Chua

TL;DR
This paper introduces CIRCLE, a benchmark to evaluate security risks in LLM code interpreters, revealing significant vulnerabilities and disparities across models, emphasizing the need for better safeguards and standards.
Contribution
The paper presents CIRCLE, a novel benchmark for systematically assessing cybersecurity risks in LLM code interpreters, including a large set of prompts and an automated evaluation framework.
Findings
Models show significant vulnerability disparities.
Indirect prompts weaken defenses substantially.
OpenAI's o4-mini outperforms GPT-4.1 in refusal rates.
Abstract
As large language models (LLMs) increasingly integrate native code interpreters, they enable powerful real-time execution capabilities, substantially expanding their utility. However, such integrations introduce potential system-level cybersecurity threats, fundamentally different from prompt-based vulnerabilities. To systematically evaluate these interpreter-specific risks, we propose CIRCLE (Code-Interpreter Resilience Check for LLM Exploits), a simple benchmark comprising 1,260 prompts targeting CPU, memory, and disk resource exhaustion. Each risk category includes explicitly malicious ("direct") and plausibly benign ("indirect") prompt variants. Our automated evaluation framework assesses not only whether LLMs refuse or generates risky code, but also executes the generated code within the interpreter environment to evaluate code correctness, simplifications made by the LLM to make…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Adversarial Robustness in Machine Learning · Software Testing and Debugging Techniques
