An Operator for Entity Extraction in MapReduce
Ndapandula Nakashole

TL;DR
This paper introduces a cost-based operator for MapReduce that optimally chooses between index-based and filter-verify approaches for efficient dictionary-based entity extraction from large datasets.
Contribution
The paper presents a novel cost-based operator that automates the selection of the best entity extraction approach in MapReduce environments, considering large dictionaries and datasets.
Findings
Significant reduction in execution time when using the operator
Effective handling of large dictionaries and datasets
Improved accuracy in entity mention detection
Abstract
Dictionary-based entity extraction involves finding mentions of dictionary entities in text. Text mentions are often noisy, containing spurious or missing words. Efficient algorithms for detecting approximate entity mentions follow one of two general techniques. The first approach is to build an index on the entities and perform index lookups of document substrings. The second approach recognizes that the number of substrings generated from documents can explode to large numbers, to get around this, they use a filter to prune many such substrings which do not match any dictionary entity and then only verify the remaining substrings if they are entity mentions of dictionary entities, by means of a text join. The choice between the index-based approach and the filter & verification-based approach is a case-to-case decision as the best approach depends on the characteristics of the input…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Database Systems and Queries · Web Data Mining and Analysis
