# Spanish-language text classification for environmental evidence synthesis using multilingual pre-trained models

**Authors:** Violeta Berdejo-Espinola, Ákos Hajas, Richard Cornford, Nan Ye, Tatsuya Amano

PMC · DOI: 10.1186/s13750-025-00370-9 · Environmental Evidence · 2025-11-12

## TL;DR

This paper shows how AI can help include Spanish-language research in environmental evidence syntheses, reducing bias and manual work.

## Contribution

A novel approach using multilingual models and class-weights to build a Spanish text classifier for evidence screening.

## Key findings

- The best model achieved 100% recall, ensuring no relevant Spanish papers were missed.
- Over 70% of irrelevant Spanish documents were filtered out using only titles and abstracts.
- The method uses a small labeled Spanish corpus and handles highly imbalanced data effectively.

## Abstract

Artificial intelligence (AI) is increasingly being explored as a tool to optimize and accelerate various stages of evidence synthesis. A persistent challenge in environmental evidence syntheses is that these remain predominantly monolingual (English), leading to biased results and misinforming cross-scale policy decisions. AI offers a promising opportunity to incorporate non-English language evidence in evidence syntheses screening process and help to move beyond the current monolingual focus of evidence syntheses. Using a corpus of Spanish-language peer-reviewed papers on biodiversity conservation interventions, we developed and evaluated text classifiers using supervised machine learning models. Our best-performing model achieved 100% recall meaning no relevant papers (n = 9) were missed and filtered out over 70% (n = 867) of negative documents based only on the title and abstract of each paper. The text was encoded using a pre-trained multilingual model and class-weights were used to deal with a highly imbalanced dataset (0.79%). This research therefore offers an approach to reducing the manual, time-intensive effort required for document screening in evidence syntheses—with minimal risk of missing relevant studies. It highlights the potential of multilingual large language models and class-weights to train a light-weight non-English language classifier that can effectively filter irrelevant texts, using only a small non-English language labelled corpus. Future work could build on our approach to develop a multilingual classifier that enables the inclusion of any non-English scientific literature in evidence syntheses.

The online version contains supplementary material available at 10.1186/s13750-025-00370-9.

## Full-text entities

- **Diseases:** XAI (MESH:C538243), fire (MESH:D000092422)
- **Species:** Phrynosomatidae (family) [taxon 2024743], Homo sapiens (human, species) [taxon 9606], Bacillus sp. AT (species) [taxon 1196779]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12613578/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12613578/full.md

## References

13 references — full list in the complete paper: https://tomesphere.com/paper/PMC12613578/full.md

---
Source: https://tomesphere.com/paper/PMC12613578