Dealing with Abbreviations in the Slovenian Biographical Lexicon

Angel Daza; Antske Fokkens; Toma\v{z} Erjavec

arXiv:2211.02429·cs.CL·November 7, 2022

Dealing with Abbreviations in the Slovenian Biographical Lexicon

Angel Daza, Antske Fokkens, Toma\v{z} Erjavec

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel method for identifying and expanding abbreviations in Slovenian biographical texts, improving NLP processing accuracy in low-resource, abbreviation-rich contexts.

Contribution

It presents a new abbreviation identification and expansion approach tailored for Slovenian biographical texts, outperforming existing ad-hoc solutions especially for unseen abbreviations.

Findings

01

Significantly better abbreviation identification accuracy

02

Effective expansion of abbreviations in context

03

Robust performance on a new Slovenian biographies dataset

Abstract

Abbreviations present a significant challenge for NLP systems because they cause tokenization and out-of-vocabulary errors. They can also make the text less readable, especially in reference printed books, where they are extensively used. Abbreviations are especially problematic in low-resource settings, where systems are less robust to begin with. In this paper, we propose a new method for addressing the problems caused by a high density of domain-specific abbreviations in a text. We apply this method to the case of a Slovenian biographical lexicon and evaluate it on a newly developed gold-standard dataset of 51 Slovenian biographies. Our abbreviation identification method performs significantly better than commonly used ad-hoc solutions, especially at identifying unseen abbreviations. We also propose and present the results of a method for expanding the identified abbreviations in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

angel-daza/abbreviation-detector
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Biomedical Text Mining and Ontologies