ScholarChemQA: Unveiling the Power of Language Models in Chemical   Research Question Answering

Xiuying Chen; Tairan Wang; Taicheng Guo; Kehan Guo; Juexiao Zhou,; Haoyang Li; Mingchen Zhuge; J\"urgen Schmidhuber; Xin Gao; Xiangliang Zhang

arXiv:2407.16931·cs.CL·July 25, 2024

ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering

Xiuying Chen, Tairan Wang, Taicheng Guo, Kehan Guo, Juexiao Zhou,, Haoyang Li, Mingchen Zhuge, J\"urgen Schmidhuber, Xin Gao, Xiangliang Zhang

PDF

1 Datasets

TL;DR

ScholarChemQA introduces a large-scale chemical question answering dataset and a specialized model that leverages data augmentation, re-weighting, and calibration to improve reasoning and understanding in chemical research questions.

Contribution

The paper presents ScholarChemQA, a novel chemical QA dataset, and QAMatch, a tailored model that addresses data imbalance and unlabeled data challenges in chemical question answering.

Findings

01

QAMatch outperforms recent baselines and LLMs on ScholarChemQA.

02

The dataset reflects real-world chemical research challenges.

03

Data augmentation and re-weighting improve model performance.

Abstract

Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth. While QA datasets are plentiful in areas like general domain and biomedicine, academic chemistry is less explored. Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format. Addressing this gap, we introduce ScholarChemQA, a large-scale QA dataset constructed from chemical papers. This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful. Correspondingly, we introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data. We first address the issue of imbalanced label distribution by re-weighting the instance-wise loss based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Kylan12/mycotoxin-chemical-research-sythetic-reasoning
dataset· 45 dl
45 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN