An unsupervised and customizable misspelling generator for mining noisy   health-related text sources

Abeed Sarker; Graciela Gonzalez-Hernandez

arXiv:1806.00910·cs.CL·June 21, 2023

An unsupervised and customizable misspelling generator for mining noisy health-related text sources

Abeed Sarker, Graciela Gonzalez-Hernandez

PDF

1 Repo

TL;DR

This paper introduces a customizable, unsupervised system that generates health-related misspellings using semantic and lexical filtering, improving text mining from noisy health data sources.

Contribution

It presents a novel, fully automatic misspelling generator leveraging dense vector models, with customizable filtering, outperforming existing methods in health-related text mining.

Findings

01

Outperforms state-of-the-art medication variant generation with F1-score of 0.69.

02

Increases Twitter post retrieval rate by over 67% with generated variants.

03

Offers a simple, customizable, and fully automatic misspelling generation system.

Abstract

In this paper, we present a customizable datacentric system that automatically generates common misspellings for complex health-related terms. The spelling variant generator relies on a dense vector model learned from large unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given threshold. The process is executed recursively, converging when no new terms similar (lexically and semantically) to the seed keyword are found. Weighting of intra-word character sequence similarities allows further problem-specific customization of the system. On a dataset prepared for this study, our system outperforms the current state-of-the-art for medication name variant generation with best F1-score of 0.69 and F1/4-score of 0.78. Extrinsic evaluation of the system on a set of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://bitbucket.org/asarker/qmisspell
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.