MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain   Acronym Extraction

Amir Pouran Ben Veyseh; Nicole Meister; Seunghyun Yoon; Rajiv Jain,; Franck Dernoncourt; Thien Huu Nguyen

arXiv:2202.09694·cs.CL·February 22, 2022·1 cites

MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain Acronym Extraction

Amir Pouran Ben Veyseh, Nicole Meister, Seunghyun Yoon, Rajiv Jain,, Franck Dernoncourt, Thien Huu Nguyen

PDF

Open Access

TL;DR

This paper introduces MACRONYM, a large multilingual, multi-domain dataset for acronym extraction, addressing the lack of annotated data across languages and domains, and highlighting unique challenges in this task.

Contribution

The creation of MACRONYM, a comprehensive dataset with 27,200 sentences across six languages and two domains, enabling research on multilingual and multi-domain acronym extraction.

Findings

01

AE varies significantly across languages and domains.

02

Existing models face challenges adapting to multilingual and multi-domain AE.

03

Further research is needed to develop robust multilingual AE methods.

Abstract

Acronym extraction is the task of identifying acronyms and their expanded forms in texts that is necessary for various NLP applications. Despite major progress for this task in recent years, one limitation of existing AE research is that they are limited to the English language and certain domains (i.e., scientific and biomedical). As such, challenges of AE in other languages and domains is mainly unexplored. Lacking annotated datasets in multiple languages and domains has been a major issue to hinder research in this area. To address this limitation, we propose a new dataset for multilingual multi-domain AE. Specifically, 27,200 sentences in 6 typologically different languages and 2 domains, i.e., Legal and Scientific, is manually annotated for AE. Our extensive experiments on the proposed dataset show that AE in different languages and different learning settings has unique…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Natural Language Processing Techniques · Advanced Text Analysis Techniques

MethodsAutoencoders