AI for Distributed Systems Design: Scalable Cloud Optimization Through Repeated LLMs Sampling And Simulators
Jacopo Tagliabue

TL;DR
This paper presents a method combining large language models and domain-specific simulators to iteratively design and verify distributed systems policies, aiming to improve scalability and performance.
Contribution
It introduces a generate-and-verify framework using LLMs and simulators for scalable distributed system policy design, with preliminary throughput improvements demonstrated.
Findings
Preliminary throughput gains across multiple models
Framework preserves interpretability and targeted search
Discussion on scaling and future directions
Abstract
We explore AI-driven distributed-systems policy design by combining stochastic code generation from large language models (LLMs) with deterministic verification in a domain-specific simulator. Using a Function-as-a-Service runtime (Bauplan) and its open-source simulator (Eudoxia) as a case study, we frame scheduler design as an iterative generate-and-verify loop: an LLM proposes a Python policy, the simulator evaluates it on standardized traces, and structured feedback steers subsequent generations. This setup preserves interpretability while enabling targeted search over a large design space. We detail the system architecture and report preliminary results on throughput improvements across multiple models. Beyond early gains, we discuss the limits of the current setup and outline next steps; in particular, we conjecture that AI will be crucial for scaling this methodology by helping to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsScientific Computing and Data Management · Advanced Software Engineering Methodologies · Machine Learning in Materials Science
