A Library Perspective on Supervised Text Processing in Digital   Libraries: An Investigation in the Biomedical Domain

Hermann Kroll; Pascal Sackhoff; Bill Matthias Thang; Maha Ksouri and; Wolf-Tilo Balke

arXiv:2411.12752·cs.DL·November 21, 2024

A Library Perspective on Supervised Text Processing in Digital Libraries: An Investigation in the Biomedical Domain

Hermann Kroll, Pascal Sackhoff, Bill Matthias Thang, Maha Ksouri and, Wolf-Tilo Balke

PDF

1 Repo

TL;DR

This paper explores practical supervised text processing workflows in digital libraries, emphasizing trade-offs between accuracy and costs, and investigates data generation methods using large language models in the biomedical domain.

Contribution

It introduces a practical perspective on supervised text processing in digital libraries, focusing on data creation, model training, and pipeline design tailored for biomedical applications.

Findings

01

Trade-offs between accuracy and application costs are analyzed.

02

Distant supervision and large language models are evaluated for data generation.

03

Eight biomedical benchmarks are used to demonstrate the approaches.

Abstract

Digital libraries that maintain extensive textual collections may want to further enrich their content for certain downstream applications, e.g., building knowledge graphs, semantic enrichment of documents, or implementing novel access paths. All of these applications require some text processing, either to identify relevant entities, extract semantic relationships between them, or to classify documents into some categories. However, implementing reliable, supervised workflows can become quite challenging for a digital library because suitable training data must be crafted, and reliable models must be trained. While many works focus on achieving the highest accuracy on some benchmarks, we tackle the problem from a digital library practitioner. In other words, we also consider trade-offs between accuracy and application costs, dive into training data generation through distant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hermannkroll/supervisedtextprocessing
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLib · Focus