The Project Dialogism Novel Corpus: A Dataset for Quotation Attribution   in Literary Texts

Krishnapriya Vishnubhotla; Adam Hammond; Graeme Hirst

arXiv:2204.05836·cs.CL·April 13, 2022·6 cites

The Project Dialogism Novel Corpus: A Dataset for Quotation Attribution in Literary Texts

Krishnapriya Vishnubhotla, Adam Hammond, Graeme Hirst

PDF

Open Access 2 Repos

TL;DR

The paper introduces the Project Dialogism Novel Corpus, a large annotated dataset of quotations in English literary texts, enabling improved evaluation of quotation attribution and coreference models.

Contribution

It provides the largest annotated corpus of literary quotations with detailed annotations, facilitating research in quotation attribution and coreference in literature.

Findings

01

Largest corpus of its kind with 35,978 quotations

02

Detailed annotations for speaker, addressees, and references

03

Enables comprehensive evaluation of attribution models

Abstract

We present the Project Dialogism Novel Corpus, or PDNC, an annotated dataset of quotations for English literary texts. PDNC contains annotations for 35,978 quotations across 22 full-length novels, and is by an order of magnitude the largest corpus of its kind. Each quotation is annotated for the speaker, addressees, type of quotation, referring expression, and character mentions within the quotation text. The annotated attributes allow for a comprehensive evaluation of models of quotation attribution and coreference for literary texts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification