TL;DR
This paper explores practical supervised text processing workflows in digital libraries, emphasizing trade-offs between accuracy and costs, and investigates data generation methods using large language models in the biomedical domain.
Contribution
It introduces a practical perspective on supervised text processing in digital libraries, focusing on data creation, model training, and pipeline design tailored for biomedical applications.
Findings
Trade-offs between accuracy and application costs are analyzed.
Distant supervision and large language models are evaluated for data generation.
Eight biomedical benchmarks are used to demonstrate the approaches.
Abstract
Digital libraries that maintain extensive textual collections may want to further enrich their content for certain downstream applications, e.g., building knowledge graphs, semantic enrichment of documents, or implementing novel access paths. All of these applications require some text processing, either to identify relevant entities, extract semantic relationships between them, or to classify documents into some categories. However, implementing reliable, supervised workflows can become quite challenging for a digital library because suitable training data must be crafted, and reliable models must be trained. While many works focus on achieving the highest accuracy on some benchmarks, we tackle the problem from a digital library practitioner. In other words, we also consider trade-offs between accuracy and application costs, dive into training data generation through distant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLib · Focus
