GQ-VAE: A gated quantized VAE for learning variable length tokens
Theo Datta, Kayla Huang, Sham Kakade, David Brandfonbrener

TL;DR
GQ-VAE introduces a novel gated quantized VAE architecture that learns variable-length tokens, improving compression and language modeling performance while serving as a flexible, drop-in replacement for traditional tokenizers.
Contribution
The paper presents GQ-VAE, a new architecture for encoding variable-length tokens that enhances compression and modeling without altering existing language model structures.
Findings
GQ-VAE outperforms standard VQ-VAE in compression and language modeling.
GQ-VAE approaches BPE's compression rate and performance.
Using BPE with smaller vocabularies, GQ-VAE improves downstream language learning.
Abstract
While most frontier models still use deterministic frequency-based tokenization algorithms such as byte-pair encoding (BPE), there has been significant recent work to design learned neural tokenizers. However, these schemes generally add to underlying language model complexity and force large changes to architecture, making them hard to implement at large scales. To overcome these challenges, we propose the gated quantized variational autoencoder (GQ-VAE), a novel architecture that can be independently pre-trained to serve as a drop-in replacement for existing tokenizers. The key innovation of the architecture is to learn to encode variable-length discrete tokens. GQ-VAE improves compression and language modeling performance over a standard VQ-VAE tokenizer, and approaches the compression rate and language modeling performance of BPE. Interestingly, if we use BPE with a smaller…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Adversarial Robustness in Machine Learning
