LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature
Magdalena Lederbauer, Siddharth Betala, Xiyao Li, Ayush Jain, Amine Sehaba, Georgia Channing, Gr\'egoire Germain, Anamaria Leonescu, Faris Flaifil, Alfonso Amayuelas, Alexandre Nozadze, Stefan P. Schmid, Mohd Zaki, Sudheesh Kumar Ethirajan, Elton Pan, Mathilde Franckel

TL;DR
LeMat-Synth is a comprehensive multi-modal toolbox that uses advanced AI models to extract, organize, and structure synthesis procedures from a vast collection of scientific literature, facilitating materials discovery.
Contribution
It introduces a novel AI-driven approach to systematically extract and organize synthesis data from literature, creating a large, structured dataset and open-source tools for the community.
Findings
Curated 81,000 papers into a structured dataset of synthesis procedures.
Achieved high extraction quality validated by expert annotations.
Provided an open-source software library for community extension.
Abstract
The development of synthesis procedures remains a fundamental challenge in materials discovery, with procedural knowledge scattered across decades of scientific literature in unstructured formats that are challenging for systematic analysis. In this paper, we propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data from materials science publications, covering text and figures. We curated 81k open-access papers, yielding LeMat-Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes, structured according to an ontology specific to materials science. The extraction quality is rigorously evaluated on a subset of 2.5k synthesis procedures through a combination of expert annotations and a scalable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Catalysis and Oxidation Reactions · Inorganic Chemistry and Materials
