Pre-training data selection for biomedical domain adaptation using   journal impact metrics

Mathieu La\"i-king; Patrick Paroubek

arXiv:2409.02725·cs.CL·September 5, 2024

Pre-training data selection for biomedical domain adaptation using journal impact metrics

Mathieu La\"i-king, Patrick Paroubek

PDF

Open Access 1 Video

TL;DR

This study investigates whether selecting biomedical pre-training data based on journal impact metrics improves language model performance, finding that impact-based pruning is ineffective but reducing data size does not harm performance.

Contribution

The paper introduces a simple impact metric-based data selection approach for biomedical domain adaptation and evaluates its effectiveness compared to using full datasets.

Findings

01

Impact-based pruning is not effective for data selection.

02

Using fewer abstracts with the same training steps maintains performance.

03

Pre-training data size reduction does not necessarily decrease model quality.

Abstract

Domain adaptation is a widely used method in natural language processing (NLP) to improve the performance of a language model within a specific domain. This method is particularly common in the biomedical domain, which sees regular publication of numerous scientific articles. PubMed, a significant corpus of text, is frequently used in the biomedical domain. The primary objective of this study is to explore whether refining a pre-training dataset using specific quality metrics for scientific papers can enhance the performance of the resulting model. To accomplish this, we employ two straightforward journal impact metrics and conduct experiments by continually pre-training BERT on various subsets of the complete PubMed training set, we then evaluate the resulting models on biomedical language understanding tasks from the BLURB benchmark. Our results show that pruning using journal impact…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Pre-training data selection for biomedical domain adaptation using journal impact metrics· underline

Taxonomy

TopicsBiomedical Text Mining and Ontologies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Softmax · Dropout · Layer Normalization · Linear Layer · Adam · Weight Decay · Dense Connections · WordPiece