Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

Gunjan Balde; Soumyadeep Roy; Mainack Mondal; Niloy Ganguly

arXiv:2605.17379·cs.CL·May 19, 2026

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly

PDF

1 Repo

TL;DR

This paper presents a parameter-efficient vocabulary adaptation method for large language models to improve specialized text summarization, reducing training time and parameter growth while enhancing summary quality.

Contribution

It introduces a unified framework combining vocabulary adaptation with pretraining, specifically addressing tokenization issues in domain-specific summarization tasks.

Findings

01

Improves semantic similarity between summaries and references.

02

Produces more coherent, relevant, and domain-specific summaries.

03

Reduces training time by 35-55% and parameter count by up to 37%.

Abstract

Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gb-kgp/VocabReplace-Then-Expand
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.