Comparative Document Analysis for Large Text Corpora
Xiang Ren, Yuanhua Lv, Kuansan Wang, Jiawei Han

TL;DR
This paper introduces a new problem called Comparative Document Analysis (CDA) that automatically identifies common and distinguishing phrases between two documents or sets, using a graph-based framework and optimization algorithms.
Contribution
It proposes a novel graph-based framework and iterative algorithm for joint discovery of commonalities and differences in document pairs, applicable across domains.
Findings
Effective in scientific and news corpora
Robust in comparing individual documents
Powerful in comparing document sets
Abstract
This paper presents a novel research problem on joint discovery of commonalities and differences between two individual documents (or document sets), called Comparative Document Analysis (CDA). Given any pair of documents from a document collection, CDA aims to automatically identify sets of quality phrases to summarize the commonalities of both documents and highlight the distinctions of each with respect to the other informatively and concisely. Our solution uses a general graph-based framework to derive novel measures on phrase semantic commonality and pairwise distinction}, and guides the selection of sets of phrases by solving two joint optimization problems. We develop an iterative algorithm to integrate the maximization of phrase commonality or distinction measure with the learning of phrase-document semantic relevance in a mutually enhancing way. Experiments on text corpora from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
