TL;DR
This paper investigates how BERT captures and forgets factual knowledge across its layers and during fine-tuning, revealing that intermediate layers hold significant knowledge and that fine-tuning can lead to forgetting, with implications for understanding model memory.
Contribution
It provides a detailed analysis of where factual knowledge resides in BERT's layers and how fine-tuning affects this knowledge, highlighting the importance of intermediate layers and training objectives.
Findings
Intermediate layers contain 17-60% of total knowledge.
Fine-tuning leads to forgetting of relational knowledge.
Ranking models retain more knowledge after fine-tuning.
Abstract
Probing complex language models has recently revealed several insights into linguistic and semantic patterns found in the learned representations. In this article, we probe BERT specifically to understand and measure the relational knowledge it captures in its parametric memory. While probing for linguistic understanding is commonly applied to all layers of BERT as well as fine-tuned models, this has not been done for factual knowledge. We utilize existing knowledge base completion tasks (LAMA) to probe every layer of pre-trained as well as fine-tuned BERT models(ranking, question answering, NER). Our findings show that knowledge is not just contained in BERT's final layers. Intermediate layers contribute a significant amount (17-60%) to the total knowledge found. Probing intermediate layers also reveals how different types of knowledge emerge at varying rates. When BERT is fine-tuned,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Adam · Linear Warmup With Linear Decay · Residual Connection · WordPiece · Attention Dropout · Dense Connections
