Training a Tokenizer for Free with Private Federated Learning
Eugene Bagdasaryan, Congzheng Song, Rogier van Dalen, Matt Seigel, and, \'Aine Cahill

TL;DR
This paper introduces a method to train a tokenizer within private federated learning without extra privacy costs, achieving near-oracle performance by sampling model-generated sequences for tokenizer training.
Contribution
The work presents a novel approach to train tokenizers during federated learning without additional privacy budget, using model sampling to improve performance.
Findings
Tokenizer trained on mismatched data worsens model perplexity by 20%.
Sub-word tokenizers outperform word-level ones in federated settings.
The proposed method achieves within 1% of oracle tokenizer performance.
Abstract
Federated learning with differential privacy, i.e. private federated learning (PFL), makes it possible to train models on private data distributed across users' devices without harming privacy. PFL is efficient for models, such as neural networks, that have a fixed number of parameters, and thus a fixed-dimensional gradient vector. Such models include neural-net language models, but not tokenizers, the topic of this work. Training a tokenizer requires frequencies of words from an unlimited vocabulary, and existing methods for finding an unlimited vocabulary need a separate privacy budget. A workaround is to train the tokenizer on publicly available data. However, in this paper we first show that a tokenizer trained on mismatched data results in worse model performance compared to a privacy-violating "oracle" tokenizer that accesses user data, with perplexity increasing by 20%. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Mobile Crowdsensing and Crowdsourcing · Data Quality and Management
