Large-Scale Differentially Private BERT
Rohan Anil, Badih Ghazi, Vineet Gupta, Ravi Kumar, Pasin Manurangsi

TL;DR
This paper demonstrates that large-scale pretraining of BERT-Large with differential privacy is feasible by using mega-batches and optimized implementation, achieving competitive accuracy with privacy guarantees.
Contribution
The authors introduce a scalable method for differentially private BERT pretraining using mega-batches and an efficient implementation with JAX and XLA, improving utility and efficiency.
Findings
Achieved 60.5% masked language model accuracy at batch size of 2 million.
Scaling batch size to millions improves DP-SGD utility for BERT.
Efficient implementation reduces overhead of DP-SGD in large-scale training.
Abstract
In this work, we study the large-scale pretraining of BERT-Large with differentially private SGD (DP-SGD). We show that combined with a careful implementation, scaling up the batch size to millions (i.e., mega-batches) improves the utility of the DP-SGD step for BERT; we also enhance its efficiency by using an increasing batch size schedule. Our implementation builds on the recent work of [SVK20], who demonstrated that the overhead of a DP-SGD step is minimized with effective use of JAX [BFH+18, FJL18] primitives in conjunction with the XLA compiler [XLA17]. Our implementation achieves a masked language model accuracy of 60.5% at a batch size of 2M, for . To put this number in perspective, non-private BERT models achieve an accuracy of 70%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Ferroelectric and Negative Capacitance Devices
MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Layer Normalization · WordPiece · Dropout · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay
