IGOT: Information Gain Optimized Tokenizer on Domain Adaptive   Pretraining

Dawei Feng; Yihai Zhang; Zhixuan Xu

arXiv:2405.09857·cs.CL·May 17, 2024

IGOT: Information Gain Optimized Tokenizer on Domain Adaptive Pretraining

Dawei Feng, Yihai Zhang, Zhixuan Xu

PDF

Open Access

TL;DR

This paper introduces IGOT, a domain-specific tokenizer optimized through information gain analysis, which enhances domain-adaptive pretraining efficiency and performance of large language models.

Contribution

The paper proposes a novel tokenizer construction method using information gain, improving domain adaptation and training efficiency for large language models.

Findings

01

11.9% token saving during pretraining

02

12.2% training time reduction

03

5.8% GPU VRAM usage decrease

Abstract

Pretrained Large Language Models (LLM) such as ChatGPT, Claude, etc. have demonstrated strong capabilities in various fields of natural language generation. However, there are still many problems when using LLM in specialized domain-specific fields. When using generative AI to process downstream tasks, a common approach is to add new knowledge (e.g., private domain knowledge, cutting-edge information) to a pretrained model through continued training or fine-tuning. However, whether there is a universal paradigm for domain adaptation training is still an open question. In this article, we proposed Information Gain Optimized Tokenizer (IGOT), which analyzes the special token set of downstream tasks, constructs a new subset using heuristic function $ϕ$ with the special token and its information gain, to build new domain-specific tokenizer, and continues pretraining on the downstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Video Analysis and Summarization · Web Data Mining and Analysis

MethodsAttention Is All You Need · Sparse Evolutionary Training · Linear Layer · Byte Pair Encoding · Multi-Head Attention · Layer Normalization · Attention Dropout · Dense Connections · Adafactor · Gated Linear Unit