ScIRGen: Synthesize Realistic and Large-Scale RAG Dataset for Scientific Research
Junyong Lin, Lu Dai, Ruiqian Han, Yijie Sui, Ruilin Wang, Xingliang Sun, Qinglin Wu, Min Feng, Hao Liu, Hui Xiong

TL;DR
This paper introduces ScIRGen, a large-scale, realistic scientific QA and retrieval dataset created using novel data augmentation and question generation techniques, to better reflect researchers' real information needs.
Contribution
We developed a new dataset generation framework, ScIRGen, that produces realistic scientific questions and answers, filling the gap in existing datasets and enabling better evaluation of scientific retrieval and QA methods.
Findings
Current methods struggle with reasoning on complex scientific questions.
The ScIRGen-Geo dataset contains 61,000 QA pairs.
Benchmarking shows room for improvement in scientific QA and retrieval.
Abstract
Scientific researchers need intensive information about datasets to effectively evaluate and develop theories and methodologies. The information needs regarding datasets are implicitly embedded in particular research tasks, rather than explicitly expressed in search queries. However, existing scientific retrieval and question-answering (QA) datasets typically address straightforward questions, which do not align with the distribution of real-world research inquiries. To bridge this gap, we developed ScIRGen, a dataset generation framework for scientific QA \& retrieval that more accurately reflects the information needs of professional science researchers, and uses it to create a large-scale scientific retrieval-augmented generation (RAG) dataset with realistic queries, datasets and papers. Technically, we designed a dataset-oriented information extraction method that leverages academic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Artificial Intelligence in Healthcare · Traditional Chinese Medicine Studies
