Paragraph-based complex networks: application to document classification   and authenticity verification

Henrique F. de Arruda; Vanessa Q. Marinho; Luciano da F. Costa; Diego; R. Amancio

arXiv:1806.08467·cs.CL·February 25, 2019

Paragraph-based complex networks: application to document classification and authenticity verification

Henrique F. de Arruda, Vanessa Q. Marinho, Luciano da F. Costa, Diego, R. Amancio

PDF

TL;DR

This paper introduces a novel paragraph-based network model that captures semantic and syntactical features of texts, improving document classification and authenticity verification, including analysis of the Voynich manuscript.

Contribution

The study presents a new paragraph network representation that effectively captures semantic features and discriminates real from artificial texts, enhancing text classification methods.

Findings

01

Real texts form communities in the network.

02

The model captures semantic features unlike traditional co-occurrence networks.

03

The framework successfully analyzed the Voynich manuscript.

Abstract

With the increasing number of texts made available on the Internet, many applications have relied on text mining tools to tackle a diversity of problems. A relevant model to represent texts is the so-called word adjacency (co-occurrence) representation, which is known to capture mainly syntactical features of texts.In this study, we introduce a novel network representation that considers the semantic similarity between paragraphs. Two main properties of paragraph networks are considered: (i) their ability to incorporate characteristics that can discriminate real from artificial, shuffled manuscripts and (ii) their ability to capture syntactical and semantic textual features. Our results revealed that real texts are organized into communities, which turned out to be an important feature for discriminating them from artificial texts. Interestingly, we have also found that, differently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.