# From Law to Gherkin: A Human-Centred Quasi-Experiment on the Quality of LLM-Generated Behavioural Specifications from Food-Safety Regulations

**Authors:** Shabnam Hassani, Mehrdad Sabetzadeh, Daniel Amyot

arXiv: 2508.20744 · 2026-03-12

## TL;DR

This study evaluates how effectively large language models can generate Gherkin behavioural specifications from legal food-safety regulations, highlighting high accuracy but emphasizing the need for human oversight due to occasional errors.

## Contribution

It provides the first systematic human evaluation of LLMs in translating legal texts into structured behavioural specifications, demonstrating their potential and limitations.

## Key findings

- High relevance and clarity ratings (95-100%)
- No significant differences between LLMs or participants
- Human review remains essential due to occasional omissions and hallucinations

## Abstract

Context: Laws and regulations increasingly shape software design, development, and quality assurance in regulated domains. Because legal provisions are written in technology-neutral language, deriving concrete specifications, requirements, and acceptance criteria to verify software compliance is difficult and error-prone. Recent advances in generative AI, especially large language models (LLMs), may help automate this process.   Objective: We present the first systematic human-subject evaluation of LLMs' ability to derive Gherkin behavioural specifications from legal texts using a quasi-experimental design. Gherkin is a domain-specific language for scenario-based system behaviour descriptions in Given-When-Then form and is well suited to automation in software development.   Methods: Ten participants evaluated 60 Gherkin specifications generated from food-safety regulations by Claude and Llama. Each participant assessed 12 specifications across five criteria: relevance, clarity, completeness, singularity, and time savings. Each specification was evaluated by two participants, yielding 120 assessments with quantitative ratings and qualitative feedback.   Results: Ratings were uniformly high in the top two categories: relevance 95%, clarity 100%, completeness 94.2%, singularity 93.4%, and time savings 91.7%. No statistically reliable differences were found across participants or between LLMs. Qualitative feedback noted occasional omissions, hallucinations, and mixed intents, underscoring the need for human oversight, especially in safety-critical domains.   Conclusion: In food safety, LLMs can assist in deriving Gherkin specifications from legal texts, but omissions and hallucinations require systematic human review.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20744/full.md

## Figures

14 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20744/full.md

---
Source: https://tomesphere.com/paper/2508.20744