TL;DR
FLEXITOKENS introduces adaptive byte-level tokenization with learnable boundaries, improving language model flexibility and performance across diverse languages and domains by reducing overfragmentation.
Contribution
The paper proposes FLEXITOKENS, a novel training objective for byte-level language models that enables adaptive tokenization, outperforming fixed subword methods in various multilingual and domain-specific tasks.
Findings
FLEXITOKENS reduces token overfragmentation across multiple benchmarks.
Achieves up to 10% improvements in token classification and generative tasks.
Demonstrates consistent benefits across model sizes and domains.
Abstract
Adapting language models to new data distributions by simple finetuning is challenging. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of text in out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries given the input byte sequence, encoding it into variable-length segments. Most tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
