Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)
Niclas Griesshaber, Jochen Streb

TL;DR
This paper demonstrates that multimodal large language models can efficiently and accurately construct large historical patent datasets from archival images, surpassing traditional methods in quality, speed, and cost.
Contribution
The study introduces an LLM-based pipeline for extracting historical patent data from images, showing its superiority over research assistants and providing open-source tools for broader use.
Findings
LLMs produce higher quality datasets than research assistants.
The pipeline is over 795 times faster and 205 times cheaper.
Open-source datasets and tools facilitate adoption by researchers.
Abstract
We leverage multimodal large language models (LLMs) to construct a dataset of 306,070 German patents (1877-1918) from 9,562 archival image scans using our LLM-based pipeline powered by Gemini-2.5-Pro and Gemini-2.5-Flash-Lite. Our benchmarking exercise provides tentative evidence that multimodal LLMs can create higher quality datasets than our research assistants, while also being more than 795 times faster and 205 times cheaper in constructing the patent dataset from our image corpus. About 20 to 50 patent entries are embedded on each page, arranged in a double-column format and printed in Gothic and Roman fonts. The font and layout complexity of our primary source material suggests to us that multimodal LLMs are a paradigm shift in how datasets are constructed in economic history. We open-source our benchmarking and patent datasets as well as our LLM-based data pipeline, which can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntellectual Property and Patents · Computational and Text Analysis Methods · Machine Learning in Materials Science
