Comparative Document Analysis for Large Text Corpora

Xiang Ren; Yuanhua Lv; Kuansan Wang; Jiawei Han

arXiv:1510.07197·cs.IR·October 27, 2015

Comparative Document Analysis for Large Text Corpora

Xiang Ren, Yuanhua Lv, Kuansan Wang, Jiawei Han

PDF

Open Access

TL;DR

This paper introduces a new problem called Comparative Document Analysis (CDA) that automatically identifies common and distinguishing phrases between two documents or sets, using a graph-based framework and optimization algorithms.

Contribution

It proposes a novel graph-based framework and iterative algorithm for joint discovery of commonalities and differences in document pairs, applicable across domains.

Findings

01

Effective in scientific and news corpora

02

Robust in comparing individual documents

03

Powerful in comparing document sets

Abstract

This paper presents a novel research problem on joint discovery of commonalities and differences between two individual documents (or document sets), called Comparative Document Analysis (CDA). Given any pair of documents from a document collection, CDA aims to automatically identify sets of quality phrases to summarize the commonalities of both documents and highlight the distinctions of each with respect to the other informatively and concisely. Our solution uses a general graph-based framework to derive novel measures on phrase semantic commonality and pairwise distinction}, and guides the selection of sets of phrases by solving two joint optimization problems. We develop an iterative algorithm to integrate the maximization of phrase commonality or distinction measure with the learning of phrase-document semantic relevance in a mutually enhancing way. Experiments on text corpora from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques