Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks

Sizhe Yuen; Ting Su; Ziyang Wang; Yali Du; Adam J. Sobey

arXiv:2505.14212·cs.CL·May 21, 2025

Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks

Sizhe Yuen, Ting Su, Ziyang Wang, Yali Du, Adam J. Sobey

PDF

Open Access

TL;DR

This paper introduces an automated dataset generation approach for knowledge-intensive question answering, using LLMs to create training data that enhances reasoning and factual accuracy in QA systems.

Contribution

It proposes a novel automated QA pair generation method leveraging LLMs, reducing human labeling and improving model reasoning capabilities.

Findings

01

Mistral-7b-v0.3 outperforms Llama-3-8b in QA tasks.

02

Generated QA pairs achieve higher BERT F1, BLEU, and ROUGE scores.

03

Automated data improves logical coherence and factual accuracy.

Abstract

A question-answering (QA) system is to search suitable answers within a knowledge base. Current QA systems struggle with queries requiring complex reasoning or real-time knowledge integration. They are often supplemented with retrieval techniques on a data source such as Retrieval-Augmented Generation (RAG). However, RAG continues to face challenges in handling complex reasoning and logical connections between multiple sources of information. A novel approach for enhancing Large Language Models (LLMs) in knowledge-intensive QA tasks is presented through the automated generation of context-based QA pairs. This methodology leverages LLMs to create fine-tuning data, reducing reliance on human labelling and improving model comprehension and reasoning capabilities. The proposed system includes an automated QA generator and a model fine-tuner, evaluated using perplexity, ROUGE, BLEU, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Educational Technology and Assessment

MethodsLinear Warmup With Linear Decay · Softmax · Attention Dropout · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Residual Connection · Byte Pair Encoding · Weight Decay