Sanitizing Synthetic Training Data Generation for Question Answering over Knowledge Graphs
Trond Linjordet, Krisztian Balog

TL;DR
This paper examines how template-based synthetic data generation for question answering over knowledge graphs can cause information leakage across data splits, affecting model performance, and proposes a sanitized partitioning method to improve generalization.
Contribution
It investigates the extent of data leakage in template-based synthetic data and introduces a sanitized partitioning approach to mitigate leakage and enhance model generalization.
Findings
Information leakage occurs across data splits due to shared templates.
Sanitized partitioning reduces leakage and improves model generalization.
Models trained on sanitized data generalize better to test data.
Abstract
Synthetic data generation is important to training and evaluating neural models for question answering over knowledge graphs. The quality of the data and the partitioning of the datasets into training, validation and test splits impact the performance of the models trained on this data. If the synthetic data generation depends on templates, as is the predominant approach for this task, there may be a leakage of information via a shared basis of templates across data splits if the partitioning is not performed hygienically. This paper investigates the extent of such information leakage across data splits, and the ability of trained models to generalize to test data when the leakage is controlled. We find that information leakage indeed occurs and that it affects performance. At the same time, the trained models do generalize to test data under the sanitized partitioning presented here.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
