Trust at Your Own Peril: A Mixed Methods Exploration of the Ability of   Large Language Models to Generate Expert-Like Systems Engineering Artifacts   and a Characterization of Failure Modes

Taylan G. Topcu; Mohammed Husain; Max Ofsa; Paul Wach

arXiv:2502.09690·cs.CL·February 17, 2025

Trust at Your Own Peril: A Mixed Methods Exploration of the Ability of Large Language Models to Generate Expert-Like Systems Engineering Artifacts and a Characterization of Failure Modes

Taylan G. Topcu, Mohammed Husain, Max Ofsa, Paul Wach

PDF

Open Access

TL;DR

This study evaluates the ability of large language models to generate systems engineering artifacts comparable to human experts, revealing they can produce similar outputs but also exhibit critical failure modes that pose risks to adoption.

Contribution

It provides a baseline assessment of LLMs' capabilities in systems engineering tasks without fine-tuning, highlighting their potential and limitations through both quantitative and qualitative analysis.

Findings

01

LLMs can generate artifacts similar to human benchmarks when prompted carefully.

02

Quantitative analysis shows LLM outputs are often indistinguishable from expert-created artifacts.

03

Identifies critical failure modes such as premature requirements, unsubstantiated estimates, and overspecification.

Abstract

Multi-purpose Large Language Models (LLMs), a subset of generative Artificial Intelligence (AI), have recently made significant progress. While expectations for LLMs to assist systems engineering (SE) tasks are paramount; the interdisciplinary and complex nature of systems, along with the need to synthesize deep-domain knowledge and operational context, raise questions regarding the efficacy of LLMs to generate SE artifacts, particularly given that they are trained using data that is broadly available on the internet. To that end, we present results from an empirical exploration, where a human expert-generated SE artifact was taken as a benchmark, parsed, and fed into various LLMs through prompt engineering to generate segments of typical SE artifacts. This procedure was applied without any fine-tuning or calibration to document baseline LLM performance. We then adopted a two-fold…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOccupational Health and Safety Research · Risk and Safety Analysis · Software Engineering Techniques and Practices