Sanitizing Synthetic Training Data Generation for Question Answering   over Knowledge Graphs

Trond Linjordet; Krisztian Balog

arXiv:2009.04915·cs.IR·September 11, 2020

Sanitizing Synthetic Training Data Generation for Question Answering over Knowledge Graphs

Trond Linjordet, Krisztian Balog

PDF

TL;DR

This paper examines how template-based synthetic data generation for question answering over knowledge graphs can cause information leakage across data splits, affecting model performance, and proposes a sanitized partitioning method to improve generalization.

Contribution

It investigates the extent of data leakage in template-based synthetic data and introduces a sanitized partitioning approach to mitigate leakage and enhance model generalization.

Findings

01

Information leakage occurs across data splits due to shared templates.

02

Sanitized partitioning reduces leakage and improves model generalization.

03

Models trained on sanitized data generalize better to test data.

Abstract

Synthetic data generation is important to training and evaluating neural models for question answering over knowledge graphs. The quality of the data and the partitioning of the datasets into training, validation and test splits impact the performance of the models trained on this data. If the synthetic data generation depends on templates, as is the predominant approach for this task, there may be a leakage of information via a shared basis of templates across data splits if the partitioning is not performed hygienically. This paper investigates the extent of such information leakage across data splits, and the ability of trained models to generalize to test data when the leakage is controlled. We find that information leakage indeed occurs and that it affects performance. At the same time, the trained models do generalize to test data under the sanitized partitioning presented here.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.