Taming Scylla: Understanding the multi-headed agentic daemon of the coding seas
Micah Villmow

TL;DR
This paper presents Scylla, a comprehensive evaluation framework for benchmarking agentic coding tools, focusing on how architectural choices impact capability and cost through structured ablation studies and the Cost-of-Pass metric.
Contribution
Introduces a model-agnostic, structured evaluation framework with multiple testing tiers to quantify the effects of architectural complexity on coding tool performance and cost.
Findings
Architectural complexity does not always lead to better quality.
The framework enables reproducible trade-off analysis between complexity and outcomes.
Cost-of-Pass effectively measures efficiency of coding agents.
Abstract
LLM-based tools are automating more software development tasks at a rapid pace, but there is no rigorous way to evaluate how different architectural choices -- prompts, skills, tools, multi-agent setups -- materially affect both capability and cost. This paper introduces Scylla, an evaluation framework for benchmarking agentic coding tools through structured ablation studies that uses seven testing tiers (T0-T6) progressively adding complexity to isolate what directly influences results and how. The key metric is Cost-of-Pass (CoP): the expected dollar cost to get one correct solution, which directly quantifies the trade-off between complexity and efficiency. The framework is model-agnostic, designed to work with any CLI tool; this paper demonstrates it with Claude Sonnet 4.5, using multiple LLM judges (Opus 4.5, Sonnet 4.5, Haiku 4.5) from the same vendor for evaluation consensus,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Advanced Software Engineering Methodologies
