Automatic Dataset Generation for Knowledge Intensive Question Answering Tasks
Sizhe Yuen, Ting Su, Ziyang Wang, Yali Du, Adam J. Sobey

TL;DR
This paper introduces an automated dataset generation approach for knowledge-intensive question answering, using LLMs to create training data that enhances reasoning and factual accuracy in QA systems.
Contribution
It proposes a novel automated QA pair generation method leveraging LLMs, reducing human labeling and improving model reasoning capabilities.
Findings
Mistral-7b-v0.3 outperforms Llama-3-8b in QA tasks.
Generated QA pairs achieve higher BERT F1, BLEU, and ROUGE scores.
Automated data improves logical coherence and factual accuracy.
Abstract
A question-answering (QA) system is to search suitable answers within a knowledge base. Current QA systems struggle with queries requiring complex reasoning or real-time knowledge integration. They are often supplemented with retrieval techniques on a data source such as Retrieval-Augmented Generation (RAG). However, RAG continues to face challenges in handling complex reasoning and logical connections between multiple sources of information. A novel approach for enhancing Large Language Models (LLMs) in knowledge-intensive QA tasks is presented through the automated generation of context-based QA pairs. This methodology leverages LLMs to create fine-tuning data, reducing reliance on human labelling and improving model comprehension and reasoning capabilities. The proposed system includes an automated QA generator and a model fine-tuner, evaluated using perplexity, ROUGE, BLEU, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Educational Technology and Assessment
MethodsLinear Warmup With Linear Decay · Softmax · Attention Dropout · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Residual Connection · Byte Pair Encoding · Weight Decay
