CABACE: Injecting Character Sequence Information and Domain Knowledge   for Enhanced Acronym and Long-Form Extraction

Nithish Kannen; Divyanshu Sheth; Abhranil Chandra; Shubhraneel Pal

arXiv:2112.13237·cs.CL·December 28, 2021

CABACE: Injecting Character Sequence Information and Domain Knowledge for Enhanced Acronym and Long-Form Extraction

Nithish Kannen, Divyanshu Sheth, Abhranil Chandra, Shubhraneel Pal

PDF

Open Access 1 Repo

TL;DR

This paper introduces CABACE, a domain-adapted, character-aware transformer framework that significantly improves acronym and long-form extraction, especially in scientific and legal texts, with strong zero-shot multilingual capabilities.

Contribution

The work presents a novel character-aware BERT-based model with augmented loss, pseudo-labeling, and adversarial training for enhanced acronym extraction in specialized domains.

Findings

01

Outperforms baseline models in scientific and legal domains

02

Achieves top scores in multilingual acronym extraction tasks

03

Demonstrates strong zero-shot generalization to non-English languages

Abstract

Acronyms and long-forms are commonly found in research documents, more so in documents from scientific and legal domains. Many acronyms used in such documents are domain-specific and are very rarely found in normal text corpora. Owing to this, transformer-based NLP models often detect OOV (Out of Vocabulary) for acronym tokens, especially for non-English languages, and their performance suffers while linking acronyms to their long forms during extraction. Moreover, pretrained transformer models like BERT are not specialized to handle scientific and legal documents. With these points being the overarching motivation behind this work, we propose a novel framework CABACE: Character-Aware BERT for ACronym Extraction, which takes into account character sequences in text and is adapted to scientific and legal domains by masked language modelling. We further use an objective with an augmented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nitkannen/backgprop-aaai-22
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Biomedical Text Mining and Ontologies · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Layer Normalization · Residual Connection · Dropout · Softmax · Attention Dropout · Dense Connections