TL;DR
This paper explores methods to optimize small BERT models for German NER, combining techniques from various BERT variants, proposing new fine-tuning methods, and introducing Whole-Word Attention to improve efficiency and performance.
Contribution
It introduces novel fine-tuning modifications and the Whole-Word Attention mechanism, enhancing small BERT models for German NER tasks with improved efficiency and accuracy.
Findings
Whole-Word Attention reduces memory usage with slight performance gains
New fine-tuning methods improve NER accuracy
Combining techniques from multiple BERT variants enhances small model performance
Abstract
Currently, the most widespread neural network architecture for training language models is the so called BERT which led to improvements in various Natural Language Processing (NLP) tasks. In general, the larger the number of parameters in a BERT model, the better the results obtained in these NLP tasks. Unfortunately, the memory consumption and the training duration drastically increases with the size of these models. In this article, we investigate various training techniques of smaller BERT models: We combine different methods from other BERT variants like ALBERT, RoBERTa, and relative positional encoding. In addition, we propose two new fine-tuning modifications leading to better performance: Class-Start-End tagging and a modified form of Linear Chain Conditional Random Fields. Furthermore, we introduce Whole-Word Attention which reduces BERTs memory usage and leads to a small…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Dropout · Dense Connections · LAMB · Softmax · Attention Dropout · Linear Warmup With Linear Decay · WordPiece
