BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets

Richard A. A. Jonker; B\'arbara Maria Ribeiro de Abreu Martins; S\'ergio Matos

arXiv:2604.26048·cs.CL·April 30, 2026

BioGraphletQA: Knowledge-Anchored Generation of Complex QA Datasets

Richard A. A. Jonker, B\'arbara Maria Ribeiro de Abreu Martins, S\'ergio Matos

PDF

1 Repo

TL;DR

This paper introduces BioGraphletQA, a scalable framework for generating complex biomedical QA datasets grounded in knowledge graphs, demonstrated by a large dataset that improves QA performance.

Contribution

It presents a novel graphlet-anchored generation framework for creating factual, complex QA data, with a new biomedical KGQA dataset and publicly available resources.

Findings

01

High scientific validity confirmed by domain expert evaluation.

02

Augmenting benchmarks with the dataset improves QA accuracy significantly.

03

The framework is generalizable to various complex QA tasks.

Abstract

This paper presents a principled and scalable framework for systematically generating complex Question Answering (QA) data. In the core of this framework is a graphlet-anchored generation process, where small subgraphs from a Knowledge Graph (KG) are used in a structured prompt to control the complexity and ensure the factual grounding of questions generated by Large Language Models. The first instantiation of this framework is BioGraphletQA, a new biomedical KGQA dataset of 119,856 QA pairs. Each entry is grounded in a graphlet of up to five nodes from the OREGANO KG, with most of the pairs being enriched with relevant document snippets from PubMed. We start by demonstrating the framework's value and the dataset's quality through evaluation by a domain expert on 106 QA pairs, confirming the high scientific validity and complexity of the generated data. Secondly, we establish its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ieeta-pt/BioGraphletQA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.