Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Alejandro Breen Herrera; Aayush Sheth; Steven G. Xu; Zhucheng Zhan; Charles Wright; Marcus Yearwood; Hongtai Wei; Sudeep Das; Danny Nightingale; Meg Watson; Charles Pollnow V

arXiv:2603.03565·cs.AI·May 4, 2026

Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants

Alejandro Breen Herrera, Aayush Sheth, Steven G. Xu, Zhucheng Zhan, Charles Wright, Marcus Yearwood, Hongtai Wei, Sudeep Das, Danny Nightingale, Meg Watson, Charles Pollnow V

PDF

TL;DR

This paper presents a comprehensive framework for evaluating and optimizing multi-agent conversational shopping assistants, addressing key challenges in deployment and performance enhancement.

Contribution

It introduces a multi-faceted evaluation rubric, a calibrated LLM-based judging pipeline, and two novel prompt-optimization strategies for multi-agent systems.

Findings

01

Developed a structured evaluation rubric for conversational shopping assistants.

02

Created a calibrated LLM-as-judge pipeline aligned with human annotations.

03

Proposed two prompt-optimization strategies: Sub-agent GEPA and MAMuT GEPA.

Abstract

Conversational shopping assistants (CSAs) represent a compelling application of agentic AI, but moving from prototype to production reveals two underexplored challenges: how to evaluate multi-turn interactions and how to optimize tightly coupled multi-agent systems. Grocery shopping further amplifies these difficulties, as user requests are often underspecified, highly preference-sensitive, and constrained by factors such as budget and inventory. In this paper, we present a practical blueprint for evaluating and optimizing conversational shopping assistants, illustrated through a production-scale AI grocery assistant. We introduce a multi-faceted evaluation rubric that decomposes end-to-end shopping quality into structured dimensions and develop a calibrated LLM-as-judge pipeline aligned with human annotations. Building on this evaluation foundation, we investigate two complementary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.