TL;DR
This paper introduces a customizable, unsupervised system that generates health-related misspellings using semantic and lexical filtering, improving text mining from noisy health data sources.
Contribution
It presents a novel, fully automatic misspelling generator leveraging dense vector models, with customizable filtering, outperforming existing methods in health-related text mining.
Findings
Outperforms state-of-the-art medication variant generation with F1-score of 0.69.
Increases Twitter post retrieval rate by over 67% with generated variants.
Offers a simple, customizable, and fully automatic misspelling generation system.
Abstract
In this paper, we present a customizable datacentric system that automatically generates common misspellings for complex health-related terms. The spelling variant generator relies on a dense vector model learned from large unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given threshold. The process is executed recursively, converging when no new terms similar (lexically and semantically) to the seed keyword are found. Weighting of intra-word character sequence similarities allows further problem-specific customization of the system. On a dataset prepared for this study, our system outperforms the current state-of-the-art for medication name variant generation with best F1-score of 0.69 and F1/4-score of 0.78. Extrinsic evaluation of the system on a set of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
