Gold Panning in Vocabulary: An Adaptive Method for Vocabulary Expansion of Domain-Specific LLMs
Chengyuan Liu, Shihang Wang, Lizhi Qing, Kun Kuang, Yangyang Kang,, Changlong Sun, Fei Wu

TL;DR
This paper introduces VEGAD, an adaptive method for selecting optimal vocabulary subsets to improve domain-specific large language models' performance, validated through experiments on Chinese datasets.
Contribution
It proposes VEGAD, a novel adaptive approach for identifying valuable vocabulary subsets to enhance domain-specific LLMs, addressing limitations of uniform vocabulary expansion.
Findings
VEGAD improves performance on domain-specific tasks.
Optimal vocabulary subset selection enhances general task performance.
Validated on three Chinese datasets.
Abstract
While Large Language Models (LLMs) demonstrate impressive generation abilities, they frequently struggle when it comes to specialized domains due to their limited domain-specific knowledge. Studies on domain-specific LLMs resort to expanding the vocabulary before fine-tuning on domain-specific corpus, aiming to decrease the sequence length and enhance efficiency during decoding, without thoroughly investigating the results of vocabulary expansion to LLMs over different domains. Our pilot study reveals that expansion with only a subset of the entire vocabulary may lead to superior performance. Guided by the discovery, this paper explores how to identify a vocabulary subset to achieve the optimal results. We introduce VEGAD, an adaptive method that automatically identifies valuable words from a given domain vocabulary. Our method has been validated through experiments on three Chinese…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · linguistics and terminology studies · Text Readability and Simplification
