MaTableGPT: GPT-based Table Data Extractor from Materials Science Literature
Gyeong Hoon Yi, Jiwoo Choi, Hyeongyun Song, Olivia Miano, Jaewoong, Choi, Kihoon Bang, Byungju Lee, Seok Su Sohn, David Buttler, Anna Hiszpanski,, Sang Soo Han, Donghun Kim

TL;DR
MaTableGPT is a GPT-based tool designed to accurately extract structured data from diverse materials science tables, enabling large-scale database creation and insightful analysis of water splitting catalysts.
Contribution
The paper introduces MaTableGPT, a novel GPT-based method with specialized table representation and filtering strategies for effective data extraction from complex scientific tables.
Findings
Achieved up to 96.8% F1 score in data extraction.
Few-shot learning balances accuracy and cost effectively.
Generated valuable insights into catalyst properties.
Abstract
Efficiently extracting data from tables in the scientific literature is pivotal for building large-scale databases. However, the tables reported in materials science papers exist in highly diverse forms; thus, rule-based extractions are an ineffective approach. To overcome this challenge, we present MaTableGPT, which is a GPT-based table data extractor from the materials science literature. MaTableGPT features key strategies of table data representation and table splitting for better GPT comprehension and filtering hallucinated information through follow-up questions. When applied to a vast volume of water splitting catalysis literature, MaTableGPT achieved an extraction accuracy (total F1 score) of up to 96.8%. Through comprehensive evaluations of the GPT usage cost, labeling cost, and extraction accuracy for the learning methods of zero-shot, few-shot and fine-tuning, we present a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Handwritten Text Recognition Techniques · Data Quality and Management
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Byte Pair Encoding · Adam · Attention Dropout · Linear Layer · Multi-Head Attention · Dropout · Dense Connections · Cosine Annealing
