Application of Computer Vision to the Automated Extraction of Metadata from Natural History Specimen Labels: A Case Study on Herbarium Specimens

Jacopo Zacchigna; Weiwei Liu; Felice Andrea Pellegrino; Adriano Peron; Francesco Roma-Marzio; Lorenzo Peruzzi; Stefano Martellos

PMC · DOI:10.3390/plants15040637·February 17, 2026

Application of Computer Vision to the Automated Extraction of Metadata from Natural History Specimen Labels: A Case Study on Herbarium Specimens

Jacopo Zacchigna, Weiwei Liu, Felice Andrea Pellegrino, Adriano Peron, Francesco Roma-Marzio, Lorenzo Peruzzi, Stefano Martellos

PDF

Open Access

TL;DR

This paper presents an automated system using computer vision to extract metadata from herbarium specimen labels, improving efficiency and accuracy over traditional OCR methods.

Contribution

A novel end-to-end solution using a fine-tuned multimodal Transformer for metadata extraction from herbarium labels without preprocessing or manual labeling.

Findings

01

The system achieved 85% accuracy using Tree Edit Distance on a test dataset from the University of Pisa.

02

Multiple labels with mixed handwriting and typewritten text posed the greatest challenge for the model.

03

The approach offers flexibility for reuse and adaptation as newer foundational models become available.

Abstract

Extracting metadata from natural history collection labels is pivotal for the online publication of digitized specimens. Building on a pre-trained multimodal Transformer, we developed an end-to-end automated solution to extract metadata from digitally imaged herbarium specimen labels and map them to Darwin Core standard concepts. A second objective was to demonstrate the feasibility of applying state-of-the-art AI techniques to biodiversity data through a real-world use case that does not require image preprocessing or additional manual labeling for training. The proposed solution does not rely on closed-source services, is fine-tuned in-house, and can be used offline and locally. It can be flexibly reused by developers to extract metadata across different herbarium collections. Furthermore, its encoder and/or decoder component can be replaced to take advantage of newer foundational…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species2

Potentilla speciosa(species)Homo sapiens(human · species)

Cell lines1

HeR-T— Mus musculus (Mouse) · Spontaneously immortalized cell line

Chemicals1

Donut

Diseases4

HeR-T -T injury to LVLM

Figures36

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques