MermaidFlow: Redefining Agentic Workflow Generation via Safety-Constrained Evolutionary Programming
Chengqi Zheng, Jianda Chen, Yueming Lyu, Wen Zheng Terence Ng, Haopeng Zhang, Yew-Soon Ong, Ivor Tsang, Haiyan Yin

TL;DR
MermaidFlow introduces a safety-constrained, graph-based evolutionary framework for generating robust, verifiable workflows in autonomous agent reasoning, significantly improving success rates and convergence speed without altering task protocols.
Contribution
It redefines workflow generation by integrating safety constraints and structured graph evolution, enhancing robustness and interpretability in agentic reasoning systems.
Findings
Improves success rates of workflow generation
Achieves faster convergence to executable plans
Provides a scalable, modular framework for agent reasoning
Abstract
Despite the promise of autonomous agentic reasoning, existing workflow generation methods frequently produce fragile, unexecutable plans due to unconstrained LLM-driven construction. We introduce MermaidFlow, a framework that redefines the agentic search space through safety-constrained graph evolution. At its core, MermaidFlow represent workflows as a verifiable intermediate representation using Mermaid, a structured and human-interpretable graph language. We formulate domain-aware evolutionary operators, i.e., crossover, mutation, insertion, and deletion, to preserve semantic correctness while promoting structural diversity, enabling efficient exploration of a high-quality, statically verifiable workflow space. Without modifying task settings or evaluation protocols, MermaidFlow achieves consistent improvements in success rates and faster convergence to executable plans on the agent…
Peer Reviews
Decision·Submitted to ICLR 2026
1. **Clear formalization**: The graph representation of workflows, type system, and evolutionary operators are well-defined 2. **Static verification mechanism**: Two-layer checking (soft + hard) ensures syntactic correctness of generated workflows 3. **Comprehensive experiments**: Cover multiple domains including math reasoning and code generation, with comparisons against multiple baselines 4. **Token efficiency**: Approximately 50% reduction in token cost compared to AFlow 5. **Case study**: F
**1. Limited Novelty** - **Essentially a variant of known paradigms**: The method still follows a per-task iterative evolutionary search paradigm, introducing new representation and constraints on top of existing frameworks. - **Questionable necessity of Mermaid**: The paper does not sufficiently justify why Mermaid is superior to Python. For LLMs, both are structured text, but LLMs have higher affinity for code representations. Moreover, current LLM-based workflow generation is far from the st
- **Novel Representation**: The use of Mermaid as a declarative intermediate representation is innovative and well-motivated. It cleanly separates planning from execution, addressing a key weakness in prior code-based methods. - **Comprehensive System Design**: The paper thoroughly describes the type system, operators, validation mechanisms (soft and hard checks), and the complete pipeline from Mermaid to executable code. - **Consistent Empirical Improvements**: The method shows improvements acr
- **Mermaid DSL Limitations**: Mermaid DSL is static by design and may be hard to express loops, conditionals or some runtime operations. The paper does not show how these limitations. Extending the DSL or documenting its expressive limits would strengthen the contribution. - **Presentation Issues**: Figures 1 and 2 appear as raster images rather than vector graphics (like pdf or svg) and lose clarity when zoomed. - **Execution Model Ablations**: The paper uses only gpt-4o-mini as the execution
* Solid theoretical formalization. * Innovative integration of evolutionary search and multi-agent optimization. * Well-structured proofs and explicit assumptions. * Broad empirical evaluation demonstrating clear advantages. * Clear, consistent, and professional presentation.
* **Incomplete experimental reporting**: Missing per-benchmark hyperparameters and statistical variance. *Suggestion:* Add full configuration tables and multi-run results. * **Unclear curriculum mechanism**: Difficulty-level staging lacks quantitative definitions. *Suggestion:* Provide formal thresholds and curriculum ablations. * **Assumptions not empirically tested**: Theoretical premises like positive information gain remain unchecked. *Suggestion:* Visualize empirical distributions of
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Machine Learning in Materials Science · Scientific Computing and Data Management
