Conditional Unigram Tokenization with Parallel Data
Gianluca Vico, Jind\v{r}inch Libovick\'y

TL;DR
This paper proposes a conditional unigram tokenization method that conditions target token probabilities on source tokens using parallel data, aiming to improve cross-lingual tasks.
Contribution
It introduces a novel conditional unigram tokenization approach that leverages parallel data to better align cross-lingual semantics.
Findings
Maintains similar statistical properties to standard unigram tokenizers.
No significant improvement in machine translation quality.
Consistent perplexity reductions observed in language modeling.
Abstract
We introduce conditional unigram tokenization, a novel approach that extends unigram tokenization by conditioning target token probabilities on source-language tokens from parallel data. Given a fixed source tokenizer, our method learns a target tokenizer that maximizes cross-lingual semantic alignment. We evaluate our tokenizer on four language pairs across different families and resource levels, examining intrinsic properties and downstream performance on machine translation and language modeling. While our conditional tokenizer maintains comparable statistical properties to standard unigram tokenizers, results are mixed: we observe no improvements in machine translation quality, but find consistent perplexity reductions in language modeling. We hypothesize that quadratic scaling of conditional probability estimation with respect to the vocabulary size creates a data efficiency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
