Hespi: A pipeline for automatically detecting information from hebarium specimen sheets
Robert Turnbull, Emily Fitzgerald, Karen Thompson, Joanne L. Birch

TL;DR
Hespi is an automated pipeline that uses computer vision and language models to efficiently extract biodiversity data from herbarium specimen sheets, reducing reliance on manual transcription.
Contribution
The paper introduces Hespi, a novel modular pipeline combining object detection, OCR, HTR, and LLMs for automated data extraction from herbarium specimens.
Findings
High accuracy in detecting and extracting text from specimen sheets.
Effective classification of label types (printed, handwritten, mixed).
Modular design allows customization and training of new models.
Abstract
Specimen-associated biodiversity data are crucial for biological, environmental, and conservation sciences. A rate shift is needed to extract data from specimen images efficiently, moving beyond human-mediated transcription. We developed `Hespi' (HErbarium Specimen sheet PIpeline) using advanced computer vision techniques to extract pre-catalogue data from primary specimen labels on herbarium specimens. Hespi integrates two object detection models: one for detecting the components of the sheet and another for fields on the primary primary specimen label. It classifies labels as printed, typed, handwritten, or mixed and uses Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) for extraction. The text is then corrected against authoritative taxon databases and refined using a multimodal Large Language Model (LLM). Hespi accurately detects and extracts text from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Mathematics, Computing, and Information Processing · Research Data Management Practices
