MermaidSeqBench: An Evaluation Benchmark for NL-to-Mermaid Sequence Diagram Generation
Basel Shbita, Farhan Ahmed, Chad DeLuca

TL;DR
MermaidSeqBench is a new benchmark designed to evaluate large language models' ability to generate accurate Mermaid sequence diagrams from natural language, addressing a key gap in assessing model correctness for software engineering tasks.
Contribution
It introduces a human-verified, synthetically-extended benchmark and an LLM-based evaluation method for assessing diagram generation quality.
Findings
State-of-the-art LLMs show significant capability gaps in diagram generation.
The benchmark enables detailed evaluation of syntax, activation, and error handling.
Initial evaluations demonstrate the benchmark's effectiveness in revealing model limitations.
Abstract
Large language models (LLMs) have shown great promise in generating structured diagrams from natural language descriptions, particularly Mermaid sequence diagrams for software engineering. However, the lack of existing benchmarks to assess the LLM's correctness on this task hinders the reliable deployment of these models in production environments. To address this shortcoming, we introduce MermaidSeqBench, a human-verified and LLM-synthetically-extended benchmark for assessing LLM capabilities in generating Mermaid sequence diagrams from natural language prompts. The benchmark consists of 132 samples developed via a hybrid methodology of human-verified flows, LLM-based augmentation, and rule-based expansion. The evaluation uses an LLM-as-a-judge model to assess generation across various fine-grained metrics such as syntax correctness, activation handling, error handling, and practical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
