Chemical Identification and Indexing in PubMed Articles via BERT and Text-to-Text Approaches
Virginia Adams, Hoo-Chang Shin, Carol Anderson, Bo Liu, Anas Abidin

TL;DR
This paper explores chemical entity recognition and indexing in PubMed articles using BERT-based models and innovative text-to-text generative approaches, achieving promising results in challenging chemical indexing tasks.
Contribution
It introduces a novel prompt-based method with T5 and GPT for chemical entity recognition and linking, extending BERT-based techniques with a new text-to-text approach.
Findings
BERT-based BioMegatron achieved high NER performance.
Self-alignment pretraining improved entity linking.
Prompt-based generative models showed promising results.
Abstract
The Biocreative VII Track-2 challenge consists of named entity recognition, entity-linking (or entity-normalization), and topic indexing tasks -- with entities and topics limited to chemicals for this challenge. Named entity recognition is a well-established problem and we achieve our best performance with BERT-based BioMegatron models. We extend our BERT-based approach to the entity linking task. After the second stage of pretraining BioBERT with a metric-learning loss strategy called self-alignment pretraining (SAP), we link entities based on the cosine similarity between their SAP-BioBERT word embeddings. Despite the success of our named entity recognition experiments, we find the chemical indexing task generally more challenging. In addition to conventional NER methods, we attempt both named entity recognition and entity linking with a novel text-to-text or "prompt" based method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Advanced Text Analysis Techniques
MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · Linear Layer · Inverse Square Root Schedule · Softmax · SentencePiece · Residual Connection · Adam · Dropout
