ARETE: an R package for Automated REtrieval from TExt with large language models
Vasco V. Branco, Jand\'o Benedek, Lidia Pivovarova, Lu\'is Correia, Pedro Cardoso

TL;DR
The ARETE R package automates species occurrence data extraction from various sources using large language models, significantly expanding data availability for conservation and ecological research.
Contribution
It introduces an open-source R package that integrates OCR, data validation, and LLM-based extraction, validated against human annotation, to streamline species data collection.
Findings
Expanded species range data by three orders of magnitude.
Automated extraction outperforms manual efforts in speed and scale.
Potential to improve conservation planning and extinction risk assessments.
Abstract
1. A hard stop for the implementation of rigorous conservation initiatives is our lack of key species data, especially occurrence data. Furthermore, researchers have to contend with an accelerated speed at which new information must be collected and processed due to anthropogenic activity. Publications ranging from scientific papers to gray literature contain this crucial information but their data are often not machine-readable, requiring extensive human work to be retrieved. 2. We present the ARETE R package, an open-source software aiming to automate data extraction of species occurrences powered by large language models, namely using the chatGPT Application Programming Interface. This R package integrates all steps of the data extraction and validation process, from Optical Character Recognition to detection of outliers and output in tabular format. Furthermore, we validate ARETE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpecies Distribution and Climate Change · Animal and Plant Science Education · Environmental DNA in Biodiversity Studies
