Large-Scale Differentially Private BERT

Rohan Anil; Badih Ghazi; Vineet Gupta; Ravi Kumar; Pasin Manurangsi

arXiv:2108.01624·cs.LG·August 4, 2021·1 cites

Large-Scale Differentially Private BERT

Rohan Anil, Badih Ghazi, Vineet Gupta, Ravi Kumar, Pasin Manurangsi

PDF

Open Access

TL;DR

This paper demonstrates that large-scale pretraining of BERT-Large with differential privacy is feasible by using mega-batches and optimized implementation, achieving competitive accuracy with privacy guarantees.

Contribution

The authors introduce a scalable method for differentially private BERT pretraining using mega-batches and an efficient implementation with JAX and XLA, improving utility and efficiency.

Findings

01

Achieved 60.5% masked language model accuracy at batch size of 2 million.

02

Scaling batch size to millions improves DP-SGD utility for BERT.

03

Efficient implementation reduces overhead of DP-SGD in large-scale training.

Abstract

In this work, we study the large-scale pretraining of BERT-Large with differentially private SGD (DP-SGD). We show that combined with a careful implementation, scaling up the batch size to millions (i.e., mega-batches) improves the utility of the DP-SGD step for BERT; we also enhance its efficiency by using an increasing batch size schedule. Our implementation builds on the recent work of [SVK20], who demonstrated that the overhead of a DP-SGD step is minimized with effective use of JAX [BFH+18, FJL18] primitives in conjunction with the XLA compiler [XLA17]. Our implementation achieves a masked language model accuracy of 60.5% at a batch size of 2M, for $ϵ = 5.36$ . To put this number in perspective, non-private BERT models achieve an accuracy of $\sim$ 70%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Ferroelectric and Negative Capacitance Devices

MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Layer Normalization · WordPiece · Dropout · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay