BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

TL;DR
BERT introduces a deep bidirectional transformer-based model pre-trained on unlabeled text, achieving state-of-the-art results across multiple NLP tasks with minimal task-specific modifications.
Contribution
It presents a novel pre-training method for deep bidirectional representations that significantly improves performance on various NLP benchmarks.
Findings
Achieves new state-of-the-art results on eleven NLP tasks.
Pushes GLUE score to 80.5%.
Improves SQuAD v1.1 and v2.0 F1 scores substantially.
Abstract
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google-bert/bert-base-uncasedmodel· 70.8M dl· ♡ 260370.8M dl♡ 2603
- 🤗google-bert/bert-base-chinesemodel· 3.4M dl· ♡ 14053.4M dl♡ 1405
- 🤗google-bert/bert-base-casedmodel· 4.7M dl· ♡ 3534.7M dl♡ 353
- 🤗google-bert/bert-base-multilingual-casedmodel· 3.7M dl· ♡ 5813.7M dl♡ 581
- 🤗bhadresh-savani/bert-base-uncased-emotionmodel· 4.9k dl· ♡ 564.9k dl♡ 56
- 🤗dslim/bert-base-NERmodel· 1.9M dl· ♡ 7031.9M dl♡ 703
- 🤗dslim/distilbert-NERmodel· 187k dl· ♡ 50187k dl♡ 50
- 🤗google-bert/bert-base-multilingual-uncasedmodel· 4.7M dl· ♡ 1534.7M dl♡ 153
- 🤗google-bert/bert-large-cased-whole-word-masking-finetuned-squadmodel· 38k dl· ♡ 138k dl♡ 1
- 🤗google-bert/bert-large-cased-whole-word-maskingmodel· 560 dl· ♡ 23560 dl♡ 23
Videos
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding· youtube
Taxonomy
Methods🗣Does Fidelity have 24 hour customer service? "Fidelity technical support" · Linear Layer · mBERT · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam
