A Technical Report: Entity Extraction using Both Character-based and Token-based Similarity
Zeyi Wen, Dong Deng, Rui Zhang, Kotagiri Ramamohanarao

TL;DR
This paper introduces a novel method for entity extraction that combines character-based and token-based similarity measures, effectively handling spelling errors and name variations with high accuracy and efficiency.
Contribution
It proposes a two-level similarity approach and techniques to reduce computational costs, improving entity extraction performance over existing methods.
Findings
Achieves high F1 scores between 0.91 and 0.97.
Significantly reduces the number of candidate substrings requiring detailed similarity computation.
Demonstrates efficiency and effectiveness on real-world datasets.
Abstract
Entity extraction is fundamental to many text mining tasks such as organisation name recognition. A popular approach to entity extraction is based on matching sub-string candidates in a document against a dictionary of entities. To handle spelling errors and name variations of entities, usually the matching is approximate and edit or Jaccard distance is used to measure dissimilarity between sub-string candidates and the entities. For approximate entity extraction from free text, existing work considers solely character-based or solely token-based similarity and hence cannot simultaneously deal with minor variations at token level and typos. In this paper, we address this problem by considering both character-based similarity and token-based similarity (i.e. two-level similarity). Measuring one-level (e.g. character-based) similarity is computationally expensive, and measuring two-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Web Data Mining and Analysis
