MermaidFlow: Redefining Agentic Workflow Generation via Safety-Constrained Evolutionary Programming

Chengqi Zheng; Jianda Chen; Yueming Lyu; Wen Zheng Terence Ng; Haopeng Zhang; Yew-Soon Ong; Ivor Tsang; Haiyan Yin

arXiv:2505.22967·cs.LG·May 30, 2025

MermaidFlow: Redefining Agentic Workflow Generation via Safety-Constrained Evolutionary Programming

Chengqi Zheng, Jianda Chen, Yueming Lyu, Wen Zheng Terence Ng, Haopeng Zhang, Yew-Soon Ong, Ivor Tsang, Haiyan Yin

PDF

Open Access 1 Repo 3 Reviews

TL;DR

MermaidFlow introduces a safety-constrained, graph-based evolutionary framework for generating robust, verifiable workflows in autonomous agent reasoning, significantly improving success rates and convergence speed without altering task protocols.

Contribution

It redefines workflow generation by integrating safety constraints and structured graph evolution, enhancing robustness and interpretability in agentic reasoning systems.

Findings

01

Improves success rates of workflow generation

02

Achieves faster convergence to executable plans

03

Provides a scalable, modular framework for agent reasoning

Abstract

Despite the promise of autonomous agentic reasoning, existing workflow generation methods frequently produce fragile, unexecutable plans due to unconstrained LLM-driven construction. We introduce MermaidFlow, a framework that redefines the agentic search space through safety-constrained graph evolution. At its core, MermaidFlow represent workflows as a verifiable intermediate representation using Mermaid, a structured and human-interpretable graph language. We formulate domain-aware evolutionary operators, i.e., crossover, mutation, insertion, and deletion, to preserve semantic correctness while promoting structural diversity, enabling efficient exploration of a high-quality, statically verifiable workflow space. Without modifying task settings or evaluation protocols, MermaidFlow achieves consistent improvements in success rates and faster convergence to executable plans on the agent…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. **Clear formalization**: The graph representation of workflows, type system, and evolutionary operators are well-defined 2. **Static verification mechanism**: Two-layer checking (soft + hard) ensures syntactic correctness of generated workflows 3. **Comprehensive experiments**: Cover multiple domains including math reasoning and code generation, with comparisons against multiple baselines 4. **Token efficiency**: Approximately 50% reduction in token cost compared to AFlow 5. **Case study**: F

Weaknesses

**1. Limited Novelty** - **Essentially a variant of known paradigms**: The method still follows a per-task iterative evolutionary search paradigm, introducing new representation and constraints on top of existing frameworks. - **Questionable necessity of Mermaid**: The paper does not sufficiently justify why Mermaid is superior to Python. For LLMs, both are structured text, but LLMs have higher affinity for code representations. Moreover, current LLM-based workflow generation is far from the st

Reviewer 02Rating 6Confidence 4

Strengths

- **Novel Representation**: The use of Mermaid as a declarative intermediate representation is innovative and well-motivated. It cleanly separates planning from execution, addressing a key weakness in prior code-based methods. - **Comprehensive System Design**: The paper thoroughly describes the type system, operators, validation mechanisms (soft and hard checks), and the complete pipeline from Mermaid to executable code. - **Consistent Empirical Improvements**: The method shows improvements acr

Weaknesses

- **Mermaid DSL Limitations**: Mermaid DSL is static by design and may be hard to express loops, conditionals or some runtime operations. The paper does not show how these limitations. Extending the DSL or documenting its expressive limits would strengthen the contribution. - **Presentation Issues**: Figures 1 and 2 appear as raster images rather than vector graphics (like pdf or svg) and lose clarity when zoomed. - **Execution Model Ablations**: The paper uses only gpt-4o-mini as the execution

Reviewer 03Rating 6Confidence 3

Strengths

* Solid theoretical formalization. * Innovative integration of evolutionary search and multi-agent optimization. * Well-structured proofs and explicit assumptions. * Broad empirical evaluation demonstrating clear advantages. * Clear, consistent, and professional presentation.

Weaknesses

* **Incomplete experimental reporting**: Missing per-benchmark hyperparameters and statistical variance. *Suggestion:* Add full configuration tables and multi-run results. * **Unclear curriculum mechanism**: Difficulty-level staging lacks quantitative definitions. *Suggestion:* Provide formal thresholds and curriculum ablations. * **Assumptions not empirically tested**: Theoretical premises like positive information gain remain unchecked. *Suggestion:* Visualize empirical distributions of

Code & Models

Repositories

chengqiarchy/mermaidflow
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Machine Learning in Materials Science · Scientific Computing and Data Management