Evaluating Large Language Models for Detecting Architectural Decision Violations
Ruoyu Su, Alexander Bakhtin, Noman Ahmad, Matteo Esposito, Valentina Lenarduzzi, Davide Taibi

TL;DR
This study evaluates the effectiveness of Large Language Models in detecting architectural decision violations in open-source software, highlighting their strengths in explicit decisions and limitations with implicit or organizational knowledge-based decisions.
Contribution
It demonstrates the potential and current limitations of LLMs in automating architectural decision validation, providing a multi-model pipeline and comprehensive analysis.
Findings
LLMs show substantial agreement and accuracy on explicit, code-inferable decisions.
Accuracy decreases for implicit or deployment-dependent decisions.
LLMs can support but not replace human judgment in architectural decision validation.
Abstract
Architectural Decision Records (ADRs) play a central role in maintaining software architecture quality, yet many decision violations go unnoticed because projects lack both systematic documentation and automated detection mechanisms. Recent advances in Large Language Models (LLMs) open up new possibilities for automating architectural reasoning at scale. We investigated how effectively LLMs can identify decision violations in open-source systems by examining their agreement, accuracy, and inherent limitations. Our study analyzed 980 ADRs across 109 GitHub repositories using a multi-model pipeline in which one LLM primary screens potential decision violations, and three additional LLMs independently validate the reasoning. We assessed agreement, accuracy, precision, and recall, and complemented the quantitative findings with expert evaluation. The models achieved substantial agreement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Software System Performance and Reliability
