Evaluation-Driven Development and Operations of LLM Agents: A Process Model and Reference Architecture
Boming Xia, Qinghua Lu, Liming Zhu, Zhenchang Xing, Dehai Zhao, Hao Zhang

TL;DR
This paper introduces EDDOps, a process model and reference architecture for continuous, evaluation-driven development and operation of LLM agents, addressing the limitations of static evaluation methods.
Contribution
It presents a novel, empirically derived framework that integrates ongoing evaluation into the lifecycle of LLM agents, enabling safer and more adaptable systems.
Findings
Developed a process model for evaluation-driven LLM agent development.
Created a reference architecture supporting continuous evaluation.
Demonstrated improved safety and adaptability in LLM agent deployment.
Abstract
Large Language Models (LLMs) have enabled the emergence of LLM agents, systems capable of pursuing under-specified goals and adapting after deployment. Evaluating such agents is challenging because their behavior is open ended, probabilistic, and shaped by system-level interactions over time. Traditional evaluation methods, built around fixed benchmarks and static test suites, fail to capture emergent behaviors or support continuous adaptation across the lifecycle. To ground a more systematic approach, we conduct a multivocal literature review (MLR) synthesizing academic and industrial evaluation practices. The findings directly inform two empirically derived artifacts: a process model and a reference architecture that embed evaluation as a continuous, governing function rather than a terminal checkpoint. Together they constitute the evaluation-driven development and operations (EDDOps)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Business Process Modeling and Analysis · Digital Rights Management and Security
