Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations
Shahin Honarvar, Amber Gorzynski, James Lee-Jones, Harry Coppock, Marek Rei, Joseph Ryan, Alastair F. Donaldson

TL;DR
This paper introduces CTF challenge families and Evolve-CTF, a tool for evaluating agentic LLM robustness on semantically-equivalent cybersecurity challenges, revealing strengths and vulnerabilities of current models.
Contribution
It presents a novel method and tool for controlled evaluation of LLM robustness using semantically-preserving code transformations in CTF challenges.
Findings
Models are robust to renaming and code insertion.
Deeper obfuscation reduces model performance.
Explicit reasoning has minimal impact on success rates.
Abstract
Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks, yet existing pointwise benchmarks offer limited insight into agent robustness and generalisation across alternative versions of the source code. We introduce CTF challenge families, whereby a single CTF is used to generate a family of semantically-equivalent challenges via semantics-preserving program transformations, enabling controlled evaluation of robustness while keeping the underlying exploit strategy fixed. We present Evolve-CTF, a tool that generates CTF families from Python challenges using a range of transformations. Using Evolve-CTF to derive families from Cybench and Intercode challenges, we evaluate 13 agentic LLM configurations with tool access. We find that models are remarkably robust to renaming and code insertion, but that composed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
