CEREC: A Corpus for Entity Resolution in Email Conversations

Parag Pravin Dakle; Dan I. Moldovan

arXiv:2105.10606·cs.CL·June 3, 2021

CEREC: A Corpus for Entity Resolution in Email Conversations

Parag Pravin Dakle, Dan I. Moldovan

PDF

1 Repo

TL;DR

This paper introduces CEREC, a large-scale annotated corpus for entity resolution in email conversations, enabling better understanding and development of coreference resolution models in email data.

Contribution

The creation of the first extensive email conversation corpus with entity coreference annotations, facilitating research in email understanding and coreference resolution.

Findings

01

Best F1 score of 59.2 for coreference resolution

02

Evaluation of multiple features and baseline models

03

Analysis of limitations and error patterns

Abstract

We present the first large scale corpus for entity resolution in email conversations (CEREC). The corpus consists of 6001 email threads from the Enron Email Corpus containing 36,448 email messages and 60,383 entity coreference chains. The annotation is carried out as a two-step process with minimal manual effort. Experiments are carried out for evaluating different features and performance of four baselines on the created corpus. For the task of mention identification and coreference resolution, a best performance of 59.2 F1 is reported, highlighting the room for improvement. An in-depth qualitative and quantitative error analysis is presented to understand the limitations of the baselines considered.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

paragdakle/emailcoref
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.