A System Model Generation Benchmark from Natural Language Requirements
Dongming Jin, Zhi Jin, Linyu Li, Zheng Fang, Jia Li, Xiaohong Chen

TL;DR
This paper introduces SysMBench, a benchmark dataset with 151 scenarios for evaluating large language models' ability to generate system models from natural language requirements, highlighting current limitations.
Contribution
The paper presents SysMBench and SysMEval, the first benchmark and evaluation metric for assessing LLMs in system model generation from natural language.
Findings
LLMs perform poorly on the benchmark, with BLEU scores up to 4%.
SysMEval achieves a maximum F1 score of 62%.
The benchmark and evaluation framework are publicly released for future research.
Abstract
System models, a critical artifact in software development, provide a formal abstraction of both the structural and behavioral aspects of software systems, which can facilitate the early requirements analysis and architecture design. However, developing system models remains challenging due to the specific syntax of model description languages and the relative scarcity of public model examples. While large language models (LLMs) have shown promise in generating code with programming languages and could potentially aid in system model development, no benchmarks currently exist for evaluating their ability to generate system models with specific description languages. We present SysMBench, which comprises 151 human-curated scenarios spanning a wide range of popular domains and varying difficulty levels. Each scenario mainly comprises a natural language requirements description, a system…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
