NetArena: Dynamic Benchmarks for AI Agents in Network Automation
Yajie Zhou, Jiajun Ruan, Eric S. Wang, Sadjad Fouladi, Francis Y. Yan, Kevin Hsieh, Zaoxing Liu

TL;DR
NetArena is a dynamic benchmarking framework for AI in network automation that improves reliability, exposes detailed behaviors, and supports fine-tuning, addressing limitations of static benchmarks in complex, real-world network environments.
Contribution
We introduce NetArena, a novel dynamic benchmark generation framework that generalizes across diverse network tasks and enables real-time, reliable evaluation of AI agents.
Findings
Reduces confidence-interval overlap from 85% to 0.
Agents achieve 13-38% performance on large-scale queries.
Exposes fine-grained behaviors missed by static benchmarks.
Abstract
As AI agents expand into high-stakes domains like network system operations, evaluating their real-world reliability becomes increasingly critical. However, existing benchmarks risk contamination due to static design, show high statistical variance from limited dataset size, and fail to reflect the complexity of production environments. We present NetArena, a dynamic benchmark generation framework for network applications. NetArena introduces a novel abstraction and unified interface that generalize across diverse tasks, enabling dynamic benchmarking despite the heterogeneity of network workloads. At runtime, users can generate unlimited queries on demand. NetArena integrates with network emulators to measure correctness, safety, and latency during execution. We demonstrate NetArena on three representative applications and find that (1) NetArena significantly improves statistical…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper effectively solves data contamination risk through dynamic generation, eliminates statistical unreliability of small datasets , and captures real-world complexity missing in existing benchmarks 1. It integrates with production-grade emulators (Mininet, Kubernetes), and provides execution-grounded assessment beyond simple correctness, including safety and latency metrics. 1. It supports 9,250+ queries with unlimited generation, while maintaining diversity across complexity levels an
1. Limited agent diversity: The evaluation only includes baseline prompting strategies (CoT, Few-shot, ReAct), which may not fully represent the capabilities of advanced LLM-based agents in network reasoning tasks. 1. The integration with high-fidelity emulators may introduce significant setup challenges, potentially reducing the reproducibility and accessibility of the framework. 1. While correctness, safety, and latency are meaningful metrics, the evaluation could be enriched with additional d
- Clear unified state–action abstraction that works across three concrete network apps (DC capacity planning, Mininet routing, K8s policy troubleshooting), not just a toy demo. - Dynamic, on-demand query generation with stochastic sampling and emulator-backed ground truth, explicitly to cut contamination and widen coverage - Execution-time evaluation on correctness, safety, and latency inside real emulators (Mininet, K8s, DC simulator), which exposes failure modes that static, correctness-only b
- RL/SFT “use cases” are proof-of-concept and on small models (Qwen2.5-0.5B, limited SFT splits), so the “can be used for rl training” claim is ahead of the evidence. - All results are still in three networking-style environments; claims of generality beyond these domains are argued but not empirically shown. - The dynamic generation relies on hand-designed templates and app-specific state equivalence/safety checks; portability to other operators’ emulators may be non-trivial.
1. The paper introduces a dynamic LLM benchmark generation framework specifically for the networking domain, demonstrating clear innovation. 2. Beyond traditional correctness metrics, the benchmark incorporates safety and latency as key evaluation dimensions, which better align with the needs of high-stakes systems. 3. The paper is well written and clearly presented.
1. Although the paper defines safety and latency evaluation standards, it lacks explicit quantitative formulas or threshold specifications. 2. The evaluation focuses on three types of network tasks, but broader validation across more diverse scenarios is missing. The authors could further discuss potential directions for future evaluation (additional experiments are not necessary). 3. While correctness, safety, and latency often involve trade-offs, the paper does not provide corresponding quanti
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
