Big Data and Cross-Document Coreference Resolution: Current State and Future Opportunities
Seyed-Mehdi-Reza Beheshti, Srikumar Venugopal, Seung Hwan Ryu, and Boualem Benatallah, Wei Wang

TL;DR
This paper reviews the current state and future challenges of Cross-Document Coreference Resolution (CDCR) in the context of big data, emphasizing scalability issues and the need for advanced techniques.
Contribution
It offers a comprehensive overview of CDCR concepts, assesses existing tools, and highlights big data challenges, guiding future research directions.
Findings
Existing CDCR tools face scalability issues with large datasets
Big data challenges hinder effective entity resolution across documents
Future work needed to improve CDCR techniques for massive datasets
Abstract
Information Extraction (IE) is the task of automatically extracting structured information from unstructured/semi-structured machine-readable documents. Among various IE tasks, extracting actionable intelligence from ever-increasing amount of data depends critically upon Cross-Document Coreference Resolution (CDCR) - the task of identifying entity mentions across multiple documents that refer to the same underlying entity. Recently, document datasets of the order of peta-/tera-bytes has raised many challenges for performing effective CDCR such as scaling to large numbers of mentions and limited representational power. The problem of analysing such datasets is called "big data". The aim of this paper is to provide readers with an understanding of the central concepts, subtasks, and the current state-of-the-art in CDCR process. We provide assessment of existing tools/techniques for CDCR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Data Quality and Management · Natural Language Processing Techniques
