Unsupervised Term Extraction for Highly Technical Domains
Francesco Fusco, Peter Staar, Diego Antognini

TL;DR
This paper presents an unsupervised term extraction method for highly technical domains, combining morphological signals and sentence-encoder metrics to improve generalization and reduce annotation costs.
Contribution
Introduces a fully unsupervised annotator and a weakly-supervised setup that enhances term extraction across diverse technical fields without requiring domain-specific annotations.
Findings
Improves predictive performance over baseline methods
Reduces inference latency on CPUs and GPUs
Provides a competitive baseline for annotation-scarce domains
Abstract
Term extraction is an information extraction task at the root of knowledge discovery platforms. Developing term extractors that are able to generalize across very diverse and potentially highly technical domains is challenging, as annotations for domains requiring in-depth expertise are scarce and expensive to obtain. In this paper, we describe the term extraction subsystem of a commercial knowledge discovery platform that targets highly technical fields such as pharma, medical, and material science. To be able to generalize across domains, we introduce a fully unsupervised annotator (UA). It extracts terms by combining novel morphological signals from sub-word tokenization with term-to-topic and intra-term similarity metrics, computed using general-domain pre-trained sentence-encoders. The annotator is used to implement a weakly-supervised setup, where transformer-models are fine-tuned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Advanced Text Analysis Techniques · Topic Modeling
