Efficient Scientific Full Text Classification: The Case of EICAT Impact Assessments
Marc Felix Brinner, Sina Zarrie{\ss}

TL;DR
This paper presents methods for efficient scientific full text classification using small BERT models and large language models, focusing on sentence selection strategies to reduce input size and improve accuracy in impact assessments.
Contribution
It introduces a novel dataset and demonstrates that sentence selection and repeated sampling enhance classification performance and efficiency over full-text models.
Findings
Sentence selection improves model accuracy and efficiency.
Repeated sampling of shorter inputs further boosts performance.
Models trained on selected sentences outperform full-text models.
Abstract
This study explores strategies for efficiently classifying scientific full texts using both small, BERT-based models and local large language models like Llama-3.1 8B. We focus on developing methods for selecting subsets of input sentences to reduce input size while simultaneously enhancing classification performance. To this end, we compile a novel dataset consisting of full-text scientific papers from the field of invasion biology, specifically addressing the impacts of invasive species. These papers are aligned with publicly available impact assessments created by researchers for the International Union for Conservation of Nature (IUCN). Through extensive experimentation, we demonstrate that various sources like human evidence annotations, LLM-generated annotations or explainability scores can be used to train sentence selection models that improve the performance of both encoder-…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques
