Dealing with Abbreviations in the Slovenian Biographical Lexicon
Angel Daza, Antske Fokkens, Toma\v{z} Erjavec

TL;DR
This paper introduces a novel method for identifying and expanding abbreviations in Slovenian biographical texts, improving NLP processing accuracy in low-resource, abbreviation-rich contexts.
Contribution
It presents a new abbreviation identification and expansion approach tailored for Slovenian biographical texts, outperforming existing ad-hoc solutions especially for unseen abbreviations.
Findings
Significantly better abbreviation identification accuracy
Effective expansion of abbreviations in context
Robust performance on a new Slovenian biographies dataset
Abstract
Abbreviations present a significant challenge for NLP systems because they cause tokenization and out-of-vocabulary errors. They can also make the text less readable, especially in reference printed books, where they are extensively used. Abbreviations are especially problematic in low-resource settings, where systems are less robust to begin with. In this paper, we propose a new method for addressing the problems caused by a high density of domain-specific abbreviations in a text. We apply this method to the case of a Slovenian biographical lexicon and evaluate it on a newly developed gold-standard dataset of 51 Slovenian biographies. Our abbreviation identification method performs significantly better than commonly used ad-hoc solutions, especially at identifying unseen abbreviations. We also propose and present the results of a method for expanding the identified abbreviations in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Biomedical Text Mining and Ontologies
