Optimizing small BERTs trained for German NER

Jochen Z\"ollner; Konrad Sperfeld; Christoph Wick; Roger Labahn

arXiv:2104.11559·cs.CL·November 2, 2021

Optimizing small BERTs trained for German NER

Jochen Z\"ollner, Konrad Sperfeld, Christoph Wick, Roger Labahn

PDF

2 Repos

TL;DR

This paper explores methods to optimize small BERT models for German NER, combining techniques from various BERT variants, proposing new fine-tuning methods, and introducing Whole-Word Attention to improve efficiency and performance.

Contribution

It introduces novel fine-tuning modifications and the Whole-Word Attention mechanism, enhancing small BERT models for German NER tasks with improved efficiency and accuracy.

Findings

01

Whole-Word Attention reduces memory usage with slight performance gains

02

New fine-tuning methods improve NER accuracy

03

Combining techniques from multiple BERT variants enhances small model performance

Abstract

Currently, the most widespread neural network architecture for training language models is the so called BERT which led to improvements in various Natural Language Processing (NLP) tasks. In general, the larger the number of parameters in a BERT model, the better the results obtained in these NLP tasks. Unfortunately, the memory consumption and the training duration drastically increases with the size of these models. In this article, we investigate various training techniques of smaller BERT models: We combine different methods from other BERT variants like ALBERT, RoBERTa, and relative positional encoding. In addition, we propose two new fine-tuning modifications leading to better performance: Class-Start-End tagging and a modified form of Linear Chain Conditional Random Fields. Furthermore, we introduce Whole-Word Attention which reduces BERTs memory usage and leads to a small…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Dropout · Dense Connections · LAMB · Softmax · Attention Dropout · Linear Warmup With Linear Decay · WordPiece