Can BERT Dig It? -- Named Entity Recognition for Information Retrieval in the Archaeology Domain
Alex Brandsen, Suzan Verberne, Karsten Lambers, Milco Wansleeben

TL;DR
This paper introduces ArcheoBERTje, a domain-specific BERT model for archaeological NER, demonstrating significant performance improvements over generic models and showing that domain-specific pre-training reduces the need for additional domain knowledge integration.
Contribution
The paper presents ArcheoBERTje, a BERT model pre-trained on Dutch archaeological texts, and compares its NER performance to generic models, highlighting the benefits of domain-specific pre-training.
Findings
ArcheoBERTje outperforms generic models with an F1 score of 0.735.
Domain-specific pre-training significantly improves NER quality.
Adding domain knowledge from a thesaurus does not enhance model performance.
Abstract
The amount of archaeological literature is growing rapidly. Until recently, these data were only accessible through metadata search. We implemented a text retrieval engine for a large archaeological text collection ( Million words). In archaeological IR, domain-specific entities such as locations, time periods, and artefacts, play a central role. This motivated the development of a named entity recognition (NER) model to annotate the full collection with archaeological named entities. In this paper, we present ArcheoBERTje, a BERT model pre-trained on Dutch archaeological texts. We compare the model's quality and output on a Named Entity Recognition task to a generic multilingual model and a generic Dutch model. We also investigate ensemble methods for combining multiple BERT models, and combining the best BERT model with a domain thesaurus using Conditional Random Fields…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Data Quality and Management
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Adam · Linear Warmup With Linear Decay · Residual Connection · WordPiece · Attention Dropout · Dense Connections
