IGOT: Information Gain Optimized Tokenizer on Domain Adaptive Pretraining
Dawei Feng, Yihai Zhang, Zhixuan Xu

TL;DR
This paper introduces IGOT, a domain-specific tokenizer optimized through information gain analysis, which enhances domain-adaptive pretraining efficiency and performance of large language models.
Contribution
The paper proposes a novel tokenizer construction method using information gain, improving domain adaptation and training efficiency for large language models.
Findings
11.9% token saving during pretraining
12.2% training time reduction
5.8% GPU VRAM usage decrease
Abstract
Pretrained Large Language Models (LLM) such as ChatGPT, Claude, etc. have demonstrated strong capabilities in various fields of natural language generation. However, there are still many problems when using LLM in specialized domain-specific fields. When using generative AI to process downstream tasks, a common approach is to add new knowledge (e.g., private domain knowledge, cutting-edge information) to a pretrained model through continued training or fine-tuning. However, whether there is a universal paradigm for domain adaptation training is still an open question. In this article, we proposed Information Gain Optimized Tokenizer (IGOT), which analyzes the special token set of downstream tasks, constructs a new subset using heuristic function with the special token and its information gain, to build new domain-specific tokenizer, and continues pretraining on the downstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Video Analysis and Summarization · Web Data Mining and Analysis
MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Byte Pair Encoding · Multi-Head Attention · Layer Normalization · Attention Dropout · Dense Connections · Adafactor · Gated Linear Unit
