Private federated discovery of out-of-vocabulary words for Gboard
Ziteng Sun, Peter Kairouz, Haicheng Sun, Adria Gascon, Ananda Theertha, Suresh

TL;DR
This paper introduces a privacy-preserving federated algorithm for discovering frequently typed out-of-vocabulary words in Gboard, ensuring strong differential privacy guarantees while improving vocabulary relevance.
Contribution
It presents a novel private federated analytics method for OOV word discovery in Gboard with local and central differential privacy guarantees.
Findings
Achieves differential privacy with ε=0.315, δ=10^{-10}
Effectively identifies common OOV words across users
Ensures user privacy during vocabulary expansion
Abstract
The vocabulary of language models in Gboard, Google's keyboard application, plays a crucial role for improving user experience. One way to improve the vocabulary is to discover frequently typed out-of-vocabulary (OOV) words on user devices. This task requires strong privacy protection due to the sensitive nature of user input data. In this report, we present a private OOV discovery algorithm for Gboard, which builds on recent advances in private federated analytics. The system offers local differential privacy (LDP) guarantees for user contributed words. With anonymous aggregation, the final released result would satisfy central differential privacy guarantees with for OOV discovery in en-US (English in United States).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
