PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows
Kazuki Kawamura, Satoshi Waki, Kei Tateno

TL;DR
PROTEA is a unified interface that enables offline evaluation, targeted refinement, and iterative improvement of multi-agent LLM workflows by localizing bottlenecks and supporting prompt revisions.
Contribution
It introduces PROTEA, a novel system for offline testing and refinement of multi-agent workflows, including backward node evaluation and visualizations for debugging.
Findings
PROTEA improved document-inspection accuracy from 64.3% to 83.9%.
PROTEA increased recommendation Hit@5 from 0.30 to 0.38.
Participants valued graph localization and prompt editing features.
Abstract
Multi-agent LLM workflows -- systems composed of multiple role-specific LLM calls -- often outperform single-prompt baselines, but they remain difficult to debug and refine. Failures can originate from subtle errors in intermediate outputs that propagate to downstream nodes, requiring developers to inspect long traces and infer which agent to modify. We present PROTEA, a unified interface for offline, test-driven improvement of multi-agent workflows. PROTEA executes a workflow, scores intermediate node outputs with configurable rubrics, and overlays per-node states and rationales on the workflow graph to localize likely bottlenecks. To support complex systems where final-answer references are the primary supervision, PROTEA performs backward node evaluation: it generates candidate node-level expectations from final-answer references and graph context, then compares them with observed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
