Reliability of Large Language Models for Design Synthesis: An Empirical Study of Variance, Prompt Sensitivity, and Method Scaffolding
Rabia Iftikhar, Andreas Rausch

TL;DR
This empirical study evaluates the reliability of large language models in software design synthesis, focusing on variance, prompt sensitivity, and the impact of method scaffolding on design quality.
Contribution
The paper introduces a preference-based few-shot prompting approach and systematically assesses its effectiveness across multiple models and prompts in design synthesis tasks.
Findings
Preference-based prompting improves adherence to design principles.
Model behavior significantly affects design reliability.
Non-determinism persists despite alignment techniques.
Abstract
Large Language Models (LLMs) are increasingly applied to automate software engineering tasks, including the generation of UML class diagrams from natural language descriptions. While prior work demonstrates that LLMs can produce syntactically valid diagrams, syntactic correctness alone does not guarantee meaningful design. This study investigates whether LLMs can move beyond diagram translation to perform design synthesis, and how reliably they maintain design-oriented reasoning under variation. We introduce a preference-based few-shot prompting approach that biases LLM outputs toward designs satisfying object-oriented principles and pattern-consistent structures. Two design-intent benchmarks, each with three domain-only, paraphrased prompts and 10 repeated runs, are used to evaluate three LLMs (ChatGPT 4o-mini, Claude 3.5 Sonnet, Gemini 2.5 Flash) across three modeling strategies:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
