Verifiable Natural Language to Linear Temporal Logic Translation: A Benchmark Dataset and Evaluation Suite

William H English; Chase Walker; Dominic Simon; Sumit Kumar Jha; Rickard Ewetz

arXiv:2507.00877·eess.SY·December 19, 2025

Verifiable Natural Language to Linear Temporal Logic Translation: A Benchmark Dataset and Evaluation Suite

William H English, Chase Walker, Dominic Simon, Sumit Kumar Jha, Rickard Ewetz

PDF

Open Access 3 Reviews

TL;DR

This paper introduces VLTL-Bench, a comprehensive benchmark dataset for evaluating the accuracy and verifiability of natural language to linear temporal logic translation systems across diverse scenarios.

Contribution

It provides a unified, extensible benchmark with multiple state spaces, sample traces, and ground truths for each translation step, addressing limitations of existing datasets.

Findings

01

Benchmark reveals current systems' limitations in grounding and verification.

02

Provides ground truths for each translation step to facilitate targeted improvements.

03

Supports evaluation of end-to-end and substep translation processes.

Abstract

Empirical evaluation of state-of-the-art natural-language (NL) to temporal-logic (TL) translation systems reveals near-perfect performance on existing benchmarks. However, current studies measure only the accuracy of the translation of NL logic into formal TL, ignoring a system's capacity to ground atomic propositions into new scenarios or environments. This is a critical feature, necessary for the verification of resulting formulas in a concrete state space. Consequently, most NL-to-TL translation frameworks propose their own bespoke dataset in which the correct grounding is known a-priori, inflating performance metrics and neglecting the need for extensible, domain-general systems. In this paper, we introduce the Verifiable Linear Temporal Logic Benchmark ( VLTL-Bench), a unifying benchmark that measures verification and verifiability of automated NL-to-LTL translation. The dataset…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

A) The paper correctly identifies a critical gap in existing benchmarks that they focus on lifted translation while ignoring grounding, which is essential for executable specifications. The observation that current datasets achieve >90% accuracy but cannot produce verifiable formulas is valuable. B) The design allowing isolated assessment of lifting, grounding, translation, and verification is well-motivated and technically sound. The metrics are clearly defined and appropriate.

Weaknesses

A) Limited Template Diversity and Generalization This is my biggest concern with this work. The benchmark is built from only 43 linguistic and logical templates (36 from nl2spec + 7 new). While thousands of examples are generated through instantiation with different actions/arguments, this approach has some limitations. 43 templates cannot capture the rich diversity of natural language specifications in real-world applications. Human stakeholders express requirements in countless syntactic stru

Reviewer 02Rating 6Confidence 2

Strengths

1. VLTL-Bench measures all four aspects and supplies ground truth for each, plus example traces to check whether formulas actually hold. 2. The scenario configs and templated generation make the benchmark easy to extend.

Weaknesses

1. All LLMs used for evaluation all extremely small & non-reasoning LLMs. The task is considered as a reasoning task, so including results for reasoning LLMs will make the evaluation more comprehensive. 2. The benchmark centers on discrete-time LTL. Related logics are discussed in L680 but not supported, limiting applicability to systems needing other temporal formalisms.

Reviewer 03Rating 6Confidence 4

Strengths

The great detection of the flaws in current approaches and the solid dataset creation and benchmark testing. The writing and related work articulation are clear. The motivation is great.

Weaknesses

The main question I am doubting is the semantic and expression diversity of the created dataset. The initial 43 expressions seem too limited. Meanwhile, I remember in the referenced work NL2TL, they utilized LLM to help synthesize the initial pairs to then do human annotation. That increases the diversity. In this study, it seems the authors do not utilize LLM for synthesizing. I wonder if the authors can explain what's the reason and possible benefits.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFormal Methods in Verification · Natural Language Processing Techniques · Topic Modeling