Unlocking Noisy Real-World Corpora for Foundation Model Pre-Training via Quality-Aware Tokenization
Arvid E. Gollwitzer, Paridhi Latawa, David de Gruijl, Deepak A. Subramanian, Adri\'an Noriega de la Colina

TL;DR
This paper introduces QA-Token, a quality-aware tokenization method that improves foundation model pre-training on noisy real-world data by incorporating data reliability into vocabulary construction, leading to significant performance gains.
Contribution
The paper proposes a novel bilevel optimization, reinforcement learning approach, and adaptive parameter learning mechanism for quality-aware tokenization tailored to noisy data.
Findings
6.7% F1 improvement in genomics variant calling
30% Sharpe ratio increase in finance
State-of-the-art pathogen detection at foundation scale
Abstract
Current tokenization methods process sequential data without accounting for signal quality, limiting their effectiveness on noisy real-world corpora. We present QA-Token (Quality-Aware Tokenization), which incorporates data reliability directly into vocabulary construction. We make three key contributions: (i) a bilevel optimization formulation that jointly optimizes vocabulary construction and downstream performance, (ii) a reinforcement learning approach that learns merge policies through quality-aware rewards with convergence guarantees, and (iii) an adaptive parameter learning mechanism via Gumbel-Softmax relaxation for end-to-end optimization. Our experimental evaluation demonstrates consistent improvements: genomics (6.7 percentage point F1 gain in variant calling over BPE), finance (30% Sharpe ratio improvement). At foundation scale, we tokenize a pretraining corpus comprising…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Genomics and Rare Diseases · Language and cultural evolution
