FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems
Mingxuan Hui, Xinyue Li, Lu Wang, Chengcheng Wan, Yifan Wang, Yimian Wang, Feiyue Song, Beining Shi, Yixi Li, Yaxiao Li

TL;DR
FLARE is a new testing framework for multi-agent LLM systems that uses coverage-guided fuzzing to detect failures, outperforming baselines and uncovering new issues.
Contribution
It introduces a novel approach to testing MAS by extracting specifications and applying coverage-guided fuzzing, addressing limitations of traditional testing methods.
Findings
Achieves 96.9% inter-agent coverage and 91.1% intra-agent coverage.
Outperforms baseline methods by 9.5% and 1.0%.
Uncovers 56 previously unknown failures.
Abstract
Multi-Agent LLM Systems (MAS) have been adopted to automate complex human workflows by breaking down tasks into subtasks. However, due to the non-deterministic behavior of LLM agents and the intricate interactions between agents, MAS applications frequently encounter failures, including infinite loops and failed tool invocations. Traditional software testing techniques are ineffective in detecting such failures due to the lack of LLM agent specification, the large behavioral space of MAS, and semantic-based correctness judgment. This paper presents FLARE, a novel testing framework tailored for MAS. FLARE takes the source code of MAS as input and extracts specifications and behavioral spaces from agent definitions. Based on these specifications, FLARE builds test oracles and conducts coverage-guided fuzzing to expose failures. It then analyzes execution logs to judge whether each test…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
