GLADIS: A General and Large Acronym Disambiguation Benchmark

Lihu Chen; Ga\"el Varoquaux; Fabian M. Suchanek

arXiv:2302.01860·cs.CL·March 15, 2023

GLADIS: A General and Large Acronym Disambiguation Benchmark

Lihu Chen, Ga\"el Varoquaux, Fabian M. Suchanek

PDF

1 Repo

TL;DR

GLADIS introduces a comprehensive, large-scale benchmark for acronym disambiguation across multiple domains, facilitating improved research and development of disambiguation models.

Contribution

The paper presents a new large-scale benchmark, GLADIS, with extensive datasets and a pre-trained model, AcroBERT, to advance acronym disambiguation research across various fields.

Findings

01

GLADIS includes 1.5M acronyms and 6.4M long forms.

02

Pre-trained AcroBERT demonstrates the benchmark's challenges and potential.

03

The benchmark covers general, scientific, and biomedical domains.

Abstract

Acronym Disambiguation (AD) is crucial for natural language understanding on various sources, including biomedical reports, scientific papers, and search engine queries. However, existing acronym disambiguation benchmarks and tools are limited to specific domains, and the size of prior benchmarks is rather small. To accelerate the research on acronym disambiguation, we construct a new benchmark named GLADIS with three components: (1) a much larger acronym dictionary with 1.5M acronyms and 6.4M long forms; (2) a pre-training corpus with 160 million sentences; (3) three datasets that cover the general, scientific, and biomedical domains. We then pre-train a language model, \emph{AcroBERT}, on our constructed corpus for general acronym disambiguation, and show the challenges and values of our new benchmark.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tigerchen52/gladis
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.