Harnessing Large Language Models for Biomedical Named Entity Recognition
Jian Chen, Leilei Su, Cong Sun

TL;DR
This paper presents BioSelectTune, a data-centric fine-tuning framework for biomedical named entity recognition that uses high-quality curated data and reformulates the task as JSON generation, achieving state-of-the-art results with less data.
Contribution
Introducing BioSelectTune, a novel data curation and fine-tuning approach that significantly improves BioNER performance using less data and outperforms existing domain-specific models.
Findings
Achieves state-of-the-art BioNER performance on multiple benchmarks.
Trained on only 50% of curated data, surpassing fully-trained baselines.
Outperforms domain-specific models like BioMedBERT.
Abstract
Background and Objective: Biomedical Named Entity Recognition (BioNER) is a foundational task in medical informatics, crucial for downstream applications like drug discovery and clinical trial matching. However, adapting general-domain Large Language Models (LLMs) to this task is often hampered by their lack of domain-specific knowledge and the performance degradation caused by low-quality training data. To address these challenges, we introduce BioSelectTune, a highly efficient, data-centric framework for fine-tuning LLMs that prioritizes data quality over quantity. Methods and Results: BioSelectTune reformulates BioNER as a structured JSON generation task and leverages our novel Hybrid Superfiltering strategy, a weak-to-strong data curation method that uses a homologous weak model to distill a compact, high-impact training dataset. Conclusions: Through extensive experiments, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Text Readability and Simplification
