ChemRxivQuest: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv Preprints
Mahmoud Amiri, Thomas Bocklitz

TL;DR
ChemRxivQuest is a curated dataset of 970 high-quality chemistry question-answer pairs extracted from ChemRxiv preprints, designed to advance NLP applications in chemistry through structured, traceable, and domain-specific data.
Contribution
This paper introduces ChemRxivQuest, a novel high-quality chemistry QA dataset created using an automated pipeline combining OCR, GPT-4o, and fuzzy matching, supporting NLP research and applications.
Findings
Dataset covers 17 chemistry subfields.
QA pairs linked to source text for traceability.
Emphasizes conceptual and mechanistic questions.
Abstract
The rapid expansion of chemistry literature poses significant challenges for researchers seeking to efficiently access domain-specific knowledge. To support advancements in chemistry-focused natural language processing (NLP), we present ChemRxivQuest, a curated dataset of 970 high-quality question-answer (QA) pairs derived from 155 ChemRxiv preprints across 17 subfields of chemistry. Each QA pair is explicitly linked to its source text segment to ensure traceability and contextual accuracy. ChemRxivQuest was constructed using an automated pipeline that combines optical character recognition (OCR), GPT-4o-based QA generation, and a fuzzy matching technique for answer verification. The dataset emphasizes conceptual, mechanistic, applied, and experimental questions, enabling applications in retrieval-based QA systems, search engine development, and fine-tuning of domain-adapted large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Advanced Text Analysis Techniques · Computational Drug Discovery Methods
