ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?
Ayush Nangia, Shikhar Mishra, Aman Gokrani, Paras Chopra

TL;DR
ISO-Bench is a new benchmark for coding agents to evaluate their effectiveness in optimizing real-world inference workloads, combining execution and LLM-based metrics for comprehensive assessment.
Contribution
The paper introduces ISO-Bench, a benchmark with 54 tasks from popular frameworks, and highlights the importance of combined metrics and scaffolding in evaluating coding agents.
Findings
No single agent dominates across codebases.
Agents often identify bottlenecks but fail to implement solutions.
Scaffolding significantly impacts agent performance.
Abstract
We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference optimization tasks. These tasks were taken from vLLM and SGLang, two of the most popular LLM serving frameworks. Each task provides an agent with a codebase and bottleneck description, whereby the agent must produce an optimization patch evaluated against expert human solutions. We curated 54 tasks from merged pull requests with measurable performance improvements. While existing benchmarks heavily use runtime-based metrics, such approaches can be gamed to pass tests without capturing the actual intent of the code changes. Therefore, we combine both hard (execution-based) and soft (LLM-based) metrics to show that both are necessary for complete evaluation. While evaluating both closed and open-source coding agents, we find no single agent dominates across codebases. Surprisingly,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Artificial Intelligence in Games
