Supervised Fine-Tuning or In-Context Learning? Evaluating LLMs for Clinical NER
Andrei Baroian

TL;DR
This paper evaluates various models for clinical Named Entity Recognition, finding that supervised fine-tuning with GPT-4o yields the best performance but at higher cost, while simple in-context learning performs well with less complexity.
Contribution
It compares BERT-based models, GPT-4o with in-context learning, and GPT-4o with supervised fine-tuning for clinical NER, highlighting the strengths and limitations of each approach.
Findings
Supervised fine-tuning with GPT-4o achieves F1 ≈ 87.1%.
Simple in-context learning outperforms complex prompts.
BERT-based models show limited improvements.
Abstract
We study clinical Named Entity Recognition (NER) on the CADEC corpus and compare three families of approaches: (i) BERT-style encoders (BERT Base, BioClinicalBERT, RoBERTa-large), (ii) GPT-4o used with few-shot in-context learning (ICL) under simple vs.\ complex prompts, and (iii) GPT-4o with supervised fine-tuning (SFT). All models are evaluated on standard NER metrics over CADEC's five entity types (ADR, Drug, Disease, Symptom, Finding). RoBERTa-large and BioClinicalBERT offer limited improvements over BERT Base, showing the limit of these family of models. Among LLM settings, simple ICL outperforms a longer, instruction-heavy prompt, and SFT achieves the strongest overall performance (F1 87.1%), albeit with higher cost. We find that the LLM achieve higher accuracy on simplified tasks, restricting classification to two labels.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
