Unsupervised Term Extraction for Highly Technical Domains

Francesco Fusco; Peter Staar; Diego Antognini

arXiv:2210.13118·cs.CL·October 25, 2022·1 cites

Unsupervised Term Extraction for Highly Technical Domains

Francesco Fusco, Peter Staar, Diego Antognini

PDF

Open Access

TL;DR

This paper presents an unsupervised term extraction method for highly technical domains, combining morphological signals and sentence-encoder metrics to improve generalization and reduce annotation costs.

Contribution

Introduces a fully unsupervised annotator and a weakly-supervised setup that enhances term extraction across diverse technical fields without requiring domain-specific annotations.

Findings

01

Improves predictive performance over baseline methods

02

Reduces inference latency on CPUs and GPUs

03

Provides a competitive baseline for annotation-scarce domains

Abstract

Term extraction is an information extraction task at the root of knowledge discovery platforms. Developing term extractors that are able to generalize across very diverse and potentially highly technical domains is challenging, as annotations for domains requiring in-depth expertise are scarce and expensive to obtain. In this paper, we describe the term extraction subsystem of a commercial knowledge discovery platform that targets highly technical fields such as pharma, medical, and material science. To be able to generalize across domains, we introduce a fully unsupervised annotator (UA). It extracts terms by combining novel morphological signals from sub-word tokenization with term-to-topic and intra-term similarity metrics, computed using general-domain pre-trained sentence-encoders. The annotator is used to implement a weakly-supervised setup, where transformer-models are fine-tuned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Advanced Text Analysis Techniques · Topic Modeling