An Operator for Entity Extraction in MapReduce

Ndapandula Nakashole

arXiv:1512.04973·cs.DB·December 17, 2015

An Operator for Entity Extraction in MapReduce

Ndapandula Nakashole

PDF

Open Access

TL;DR

This paper introduces a cost-based operator for MapReduce that optimally chooses between index-based and filter-verify approaches for efficient dictionary-based entity extraction from large datasets.

Contribution

The paper presents a novel cost-based operator that automates the selection of the best entity extraction approach in MapReduce environments, considering large dictionaries and datasets.

Findings

01

Significant reduction in execution time when using the operator

02

Effective handling of large dictionaries and datasets

03

Improved accuracy in entity mention detection

Abstract

Dictionary-based entity extraction involves finding mentions of dictionary entities in text. Text mentions are often noisy, containing spurious or missing words. Efficient algorithms for detecting approximate entity mentions follow one of two general techniques. The first approach is to build an index on the entities and perform index lookups of document substrings. The second approach recognizes that the number of substrings generated from documents can explode to large numbers, to get around this, they use a filter to prune many such substrings which do not match any dictionary entity and then only verify the remaining substrings if they are entity mentions of dictionary entities, by means of a text join. The choice between the index-based approach and the filter & verification-based approach is a case-to-case decision as the best approach depends on the characteristics of the input…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Advanced Database Systems and Queries · Web Data Mining and Analysis