TL;DR
This paper presents a parameter-efficient vocabulary adaptation method for large language models to improve specialized text summarization, reducing training time and parameter growth while enhancing summary quality.
Contribution
It introduces a unified framework combining vocabulary adaptation with pretraining, specifically addressing tokenization issues in domain-specific summarization tasks.
Findings
Improves semantic similarity between summaries and references.
Produces more coherent, relevant, and domain-specific summaries.
Reduces training time by 35-55% and parameter count by up to 37%.
Abstract
Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
