Validating Formal Specifications with LLM-generated Test Cases

Alcino Cunha; Nuno Macedo

arXiv:2510.23350·cs.SE·February 19, 2026

Validating Formal Specifications with LLM-generated Test Cases

Alcino Cunha, Nuno Macedo

PDF

TL;DR

This paper evaluates the use of GPT-5 and other LLMs to automatically generate test cases from natural language requirements for formal specifications, demonstrating high effectiveness in detecting specification errors.

Contribution

It introduces an empirical approach to automate test case generation for formal specifications using state-of-the-art LLMs, focusing on Alloy models and validating their effectiveness.

Findings

01

GPT-5 generates syntactically correct test cases effectively.

02

LLMs can detect many human-written specification errors.

03

Other LLMs also show promising results.

Abstract

Validation is a central activity when developing formal specifications. Similarly to coding, a possible validation technique is to define upfront test cases or scenarios that a future specification should satisfy or not. Unfortunately, specifying such test cases is burdensome and error prone, which could cause users to skip this validation task. This paper reports the results of an empirical evaluation of using pre-trained large language models (LLMs) to automate the generation of test cases from natural language requirements. In particular, we focus on test cases for structural requirements of simple domain models formalized in the Alloy specification language. Our evaluation focuses on the state-of-the-art GPT-5 model, but results from other closed- and open-source LLMs are also reported. The results show that, in this context, GPT-5 is already quite effective at generating positive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.