CABACE: Injecting Character Sequence Information and Domain Knowledge for Enhanced Acronym and Long-Form Extraction
Nithish Kannen, Divyanshu Sheth, Abhranil Chandra, Shubhraneel Pal

TL;DR
This paper introduces CABACE, a domain-adapted, character-aware transformer framework that significantly improves acronym and long-form extraction, especially in scientific and legal texts, with strong zero-shot multilingual capabilities.
Contribution
The work presents a novel character-aware BERT-based model with augmented loss, pseudo-labeling, and adversarial training for enhanced acronym extraction in specialized domains.
Findings
Outperforms baseline models in scientific and legal domains
Achieves top scores in multilingual acronym extraction tasks
Demonstrates strong zero-shot generalization to non-English languages
Abstract
Acronyms and long-forms are commonly found in research documents, more so in documents from scientific and legal domains. Many acronyms used in such documents are domain-specific and are very rarely found in normal text corpora. Owing to this, transformer-based NLP models often detect OOV (Out of Vocabulary) for acronym tokens, especially for non-English languages, and their performance suffers while linking acronyms to their long forms during extraction. Moreover, pretrained transformer models like BERT are not specialized to handle scientific and legal documents. With these points being the overarching motivation behind this work, we propose a novel framework CABACE: Character-Aware BERT for ACronym Extraction, which takes into account character sequences in text and is adapted to scientific and legal domains by masked language modelling. We further use an objective with an augmented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Biomedical Text Mining and Ontologies · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Layer Normalization · Residual Connection · Dropout · Softmax · Attention Dropout · Dense Connections
