The Wikidata Query Logs Dataset

Sebastian Walter; Hannah Bast

arXiv:2602.14594·cs.CL·May 20, 2026

The Wikidata Query Logs Dataset

Sebastian Walter, Hannah Bast

PDF

1 Repo

TL;DR

The paper introduces the Wikidata Query Logs dataset, comprising 335,000 question-query pairs derived from real-world SPARQL queries, enabling improved training of question-answering systems.

Contribution

It presents a novel large-scale dataset constructed from actual query logs and an agent-based method for de-anonymizing and verifying queries, which was not previously available.

Findings

01

Dataset is over 11 times larger than existing similar datasets.

02

The agent-based method effectively de-anonymizes and verifies queries.

03

The dataset improves training for question-answering methods.

Abstract

We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 335k question-query pairs over the Wikidata knowledge graph. It is over 11x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the benefit of this dataset for training question-answering methods. All WDQL assets, as well as the agent code, are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ad-freiburg/wikidata-query-logs
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Topic Modeling · Semantic Web and Ontologies