When Active Learning Falls Short: An Empirical Study on Chemical Reaction Extraction
Simin Yu, Sufia Fathima

TL;DR
This study systematically evaluates active learning strategies for chemical reaction extraction, revealing challenges and insights for improving data efficiency in chemical information tasks.
Contribution
It introduces a comprehensive analysis of active learning methods integrated with transformer-CRF models for chemical reaction extraction, highlighting task-dependent behaviors and limitations.
Findings
Some methods approach full-data performance with fewer labels
Learning curves are often non-monotonic and task-dependent
Pretraining, CRF decoding, and label sparsity affect active learning stability
Abstract
The rapid growth of chemical literature has generated vast amounts of unstructured data, where reaction information is particularly valuable for applications such as reaction predictions and drug design. However, the prohibitive cost of expert annotation has led to a scarcity of training data, severely hindering the performance of automatic reaction extraction. In this work, we conduct a systematic study of active learning for chemical reaction extraction. We integrate six uncertainty- and diversity-based strategies with pretrained transformer-CRF architectures, and evaluate them on product extraction and role labeling task. While several methods approach full-data performance with fewer labeled instances, learning curves are often non-monotonic and task-dependent. Our analysis shows that strong pretraining, structured CRF decoding, and label sparsity limit the stability of conventional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
