Named entity recognition in chemical patents using ensemble of contextual language models
Jenny Copara, Nona Naderi, Julien Knafou, Patrick Ruch and, Douglas Teodoro

TL;DR
This paper presents an ensemble of contextual language models for extracting chemical reaction information from patents, achieving high accuracy and demonstrating the effectiveness of ensemble methods in chemical text mining.
Contribution
It introduces a new ensemble approach combining transformer models trained on generic and specialized corpora for chemical patent information extraction.
Findings
Achieved an exact F1-score of 92.30%
Achieved a relaxed F1-score of 96.24%
Ensemble models outperform individual models in chemical patent NER
Abstract
Chemical patent documents describe a broad range of applications holding key reaction and compound information, such as chemical structure, reaction formulas, and molecular properties. These informational entities should be first identified in text passages to be utilized in downstream tasks. Text mining provides means to extract relevant information from chemical patents through information extraction techniques. As part of the Information Extraction task of the Cheminformatics Elsevier Melbourne University challenge, in this work we study the effectiveness of contextualized language models to extract reaction information in chemical patents. We assess transformer architectures trained on a generic and specialised corpora to propose a new ensemble model. Our best model, based on a majority ensemble approach, achieves an exact F1-score of 92.30% and a relaxed F1-score of 96.24%. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Advanced Text Analysis Techniques
