# Utilizing large language models to construct a dataset of Württemberg’s 19th-century fauna from historical records

**Authors:** Maximilian C. Teich, Belen Escobari, Malte Rehbein

PMC · DOI: 10.1371/journal.pone.0344181 · PLOS One · 2026-03-24

## TL;DR

This paper shows how AI can automatically extract information about 19th-century animal species from old texts in Württemberg, making it easier to study historical biodiversity.

## Contribution

The novel use of large language models to extract and link species mentions from historical texts to GBIF identifiers is introduced.

## Key findings

- LLMs achieved high recall (92.6%) and precision (95.3%) in identifying species mentions in historical texts.
- Species identifiers were matched to GBIF with 83.0% accuracy.
- The approach is scalable and adaptable to other historical contexts and languages.

## Abstract

Constructing datasets on past biodiversity from historical sources is crucial for understanding long-term ecological changes. Typically, compiling such datasets relies on prior knowledge of the sources’ composition and requires considerable manual effort. To overcome these challenges, we implement an automated approach based on prompted large language models (LLMs) to detect mentions of species in texts from 19th-century Württemberg and link these mentions to identifiers in the GBIF database. Based on our evaluation, we find that LLMs can reliably identify species in the texts with high recall (92.6%) and precision (95.3%), while providing estimates of the correct species identifier with considerable accuracy (83.0%). As our approach is easily scalable and adaptable to other contexts and languages, it offers a promising way to advance dataset generation from historical material using limited resources.

## Full-text entities

- **Diseases:** LLMs (MESH:D007806), Salamandra maculosa (MESH:C563349), hallucinations (MESH:D006212)
- **Chemicals:** GPT-4o (-)
- **Species:** Philaenus spumarius (meadow spittlebug, species) [taxon 36667], Haliclona sp. ARE (species) [taxon 1804645], Falco tinnunculus (common kestrel, species) [taxon 100819], Suidae (boars, family) [taxon 9821], Sus scrofa (pig, species) [taxon 9823], Lepus europaeus (European hare, species) [taxon 9983], Ursus arctos (brown bear, species) [taxon 9644], Barbus fluviatilis (barbel, species) [taxon 98391], Lepus (hares, genus) [taxon 9980], Nicrophorus vespillo (species) [taxon 483353], Lyrurus tetrix (black grouse, species) [taxon 1233216], Capreolus capreolus (Western roe deer, species) [taxon 9858], Salamandra salamandra (European fire salamander, species) [taxon 57571], Carduelis carduelis (Eurasian goldfinch, species) [taxon 37600], Homo sapiens (human, species) [taxon 9606], Anser fabalis (Bean goose, species) [taxon 132587], Barbus barbus (barbel, species) [taxon 40830]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13012729/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13012729/full.md

## References

56 references — full list in the complete paper: https://tomesphere.com/paper/PMC13012729/full.md

---
Source: https://tomesphere.com/paper/PMC13012729