Benchmarking Requirement-to-Architecture Generation with Hybrid Evaluation
Minxiao Li, Shuying Yan, Li Zhang, Yang Liu, Fang Liu

TL;DR
This paper introduces R2ABench, a new benchmark dataset and hybrid evaluation framework for assessing requirement-to-architecture generation by LLMs, revealing their strengths and limitations in structural reasoning.
Contribution
It provides the first comprehensive dataset and multi-dimensional evaluation framework for LLM-based software architecture generation from requirements.
Findings
LLMs excel at syntactic validity and entity extraction.
They struggle with relational reasoning, leading to fragmented architectures.
Code-specialized models help but do not fully solve the relational reasoning issue.
Abstract
Recently, Large Language Models (LLMs) have demonstrated significant potential in automating software engineering tasks. Generating software architecture designs from requirement documents is a crucial step in software development. However, there is currently a lack of functional datasets tailored for this task. To bridge this gap, we introduce R2ABench (Requirement-To-Architecture Benchmark), a novel benchmark comprising diverse real-world software projects paired with comprehensive Product Requirements Documents (PRDs) and expert-curated PlantUML reference diagrams. Furthermore, we propose a multi-dimensional, hybrid evaluation framework that assesses generated diagrams across three complementary layers: Structural Graph Metrics, Multi-dimensional Scoring, and Architecture Anti-pattern Detection. Using this framework, we conducted a comprehensive empirical study evaluating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
