Uncertainty-Informed Screening for Safer Solvents Used in the Synthesis of Perovskite via Language Models
Arpan Mukherjee, Deepesh Giri, Krishna Rajan

TL;DR
This paper introduces an uncertainty-informed framework combining language models for data extraction and deep learning for toxicity prediction to improve safety assessments of solvents in perovskite synthesis.
Contribution
It presents a novel approach integrating targeted language model data extraction with uncertainty quantification for toxicity prediction in solvents.
Findings
Automated data extraction improves data quality and reduces hallucinations.
Uncertainty quantification identifies data gaps and enhances prediction confidence.
Visualization reveals key solvent interactions related to hazards.
Abstract
The challenge of accurately predicting toxicity of industrial solvents used in perovskite synthesis is a necessary undertaking but is limited by a lack of a targeted and structured toxicity data. This paper presents a novel framework that combines an automated data extraction using language models, and an uncertainty-informed prediction model to fill data gaps and improve prediction confidence. First, we have utilized and compared two approaches to automatically extract relevant data from a corpus of scientific literature on solvents used in perovskite synthesis: smaller bidirectional language models like BERT and ELMo are used for their repeatability and deterministic outputs, while autoregressive large language model (LLM) such as GPT-3.5 is used to leverage its larger training corpus and better response generation. Our novel 'prompting and verification' technique integrated with an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Criteria Decision Making
