T2S-Metrics: Unified Library for Evaluating SPARQL Queries Generated From Natural Language

Yousouf Taghzouti (ICN; WIMMICS; Laboratoire I3S - SPARKS); Tao Jiang (ICN); Camille Juign\'e (WIMMICS; Laboratoire I3S - SPARKS); Benjamin Navet (ICN; WIMMICS; Laboratoire I3S - SPARKS); Fabien Gandon (WIMMICS; Laboratoire I3S - SPARKS); Franck Michel (Laboratoire I3S - SPARKS; WIMMICS); Louis-Felix Nothias (ICN)

arXiv:2604.26971·cs.IR·May 1, 2026

T2S-Metrics: Unified Library for Evaluating SPARQL Queries Generated From Natural Language

Yousouf Taghzouti (ICN, WIMMICS, Laboratoire I3S - SPARKS), Tao Jiang (ICN), Camille Juign\'e (WIMMICS, Laboratoire I3S - SPARKS), Benjamin Navet (ICN, WIMMICS, Laboratoire I3S - SPARKS), Fabien Gandon (WIMMICS, Laboratoire I3S - SPARKS), Franck Michel (Laboratoire I3S - SPARKS

PDF

TL;DR

t2s-metrics is an open-source, comprehensive library that standardizes and extends the evaluation of SPARQL query generation and execution in QA systems over Knowledge Graphs.

Contribution

It introduces a unified, extensible evaluation framework with over 20 metrics covering lexical, syntactic, semantic, structural, and ranking aspects, improving reproducibility and diagnostic insights.

Findings

01

Provides a broad set of evaluation metrics from literature and practical needs.

02

Enables consistent, transparent, and reproducible assessment of SPARQL-based QA systems.

03

Facilitates deeper analysis beyond simple answer correctness.

Abstract

The evaluation of Question Answering (QA) systems over Knowledge Graphs has historically suffered from fragmentation, inconsistency, and limited reproducibility. While significant progress has been made in semantic parsing and SPARQL query generation, evaluation methodologies remain diverse, ad hoc, and often incomparable across studies. Existing benchmarks typically focus on a small subset of metrics, such as query exact match or answer-level F1, neglecting syntactic validity, semantic faithfulness, execution correctness, results ranking quality, and computational efficiency. In this paper, we present t2s-metrics, an open-source, extensible, and unified evaluation library designed specifically for SPARQL query comparison and execution-based assessment. t2s-metrics provides a broad and extensible set of over 20 evaluation metrics, collected from the literature and practical evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.