Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings

Gunjan Balde; Soumyadeep Roy; Mainack Mondal; Niloy Ganguly

arXiv:2505.21242·cs.CL·April 22, 2026

Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings

Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly

PDF

1 Repo 1 Video

TL;DR

This study evaluates how vocabulary adaptation improves large language models' performance in medical text summarization, especially in high out-of-vocabulary scenarios, through extensive experiments and human assessments.

Contribution

It demonstrates that vocabulary adaptation strategies significantly enhance LLMs' medical summarization performance and relevance, addressing vocabulary mismatch issues.

Findings

01

Vocabulary adaptation improves summarization accuracy in high OOV settings.

02

Llama-3.1 still faces fragmentation issues despite large vocabulary size.

03

Human evaluations favor vocabulary-adapted models for relevance and faithfulness.

Abstract

Large Language Models (LLMs) recently achieved great success in medical text summarization by simply using in-context learning. However, these recent efforts do not perform fine-grained evaluations under difficult settings where LLMs might fail. They typically report performance scores over the entire dataset. Through our benchmarking study, we show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty. Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue where the LLM vocabulary gets updated with certain expert domain (here, medical) words or subwords. An interesting finding from our study is that Llama-3.1, even with a vocabulary size of around 128K tokens, still faces over-fragmentation issue with medical words. To that end, we show vocabulary adaptation helps improve the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gb-kgp/LLM-MedicalSummarization-Benchmark
github

Videos

Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings· underline