Structured information extraction from complex scientific text with fine-tuned large language models
Alexander Dunn, John Dagdelen, Nicholas Walker, Sanghoon Lee, Andrew, S. Rosen, Gerbrand Ceder, Kristin Persson, Anubhav Jain

TL;DR
This paper introduces a fine-tuned GPT-3 based method for extracting structured scientific information from complex texts, demonstrating high accuracy across multiple chemistry-related tasks and providing an accessible approach for building large knowledge databases.
Contribution
It presents a simple, flexible sequence-to-sequence fine-tuning approach using GPT-3 for joint entity and relation extraction in scientific texts, applicable across multiple tasks.
Findings
Accurately extracts complex scientific data from unstructured text
Works across sentences and passages for comprehensive information gathering
Enables creation of large structured knowledge databases
Abstract
Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Adam · Cosine Annealing · Residual Connection · 15 Ways to Contact How can i speak to someone at Delta Airlines · Linear Warmup With Cosine Annealing · Softmax
