TL;DR
The paper introduces the Wikidata Query Logs dataset, comprising 335,000 question-query pairs derived from real-world SPARQL queries, enabling improved training of question-answering systems.
Contribution
It presents a novel large-scale dataset constructed from actual query logs and an agent-based method for de-anonymizing and verifying queries, which was not previously available.
Findings
Dataset is over 11 times larger than existing similar datasets.
The agent-based method effectively de-anonymizes and verifies queries.
The dataset improves training for question-answering methods.
Abstract
We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 335k question-query pairs over the Wikidata knowledge graph. It is over 11x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the benefit of this dataset for training question-answering methods. All WDQL assets, as well as the agent code, are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Semantic Web and Ontologies
