Learning Document-Level Semantic Properties from Free-Text Annotations
S.R.K. Branavan, Harr Chen, Jacob Eisenstein, Regina Barzilay

TL;DR
This paper introduces a hierarchical Bayesian model that leverages noisy free-text annotations to infer and cluster semantic properties of documents, improving the summarization of reviews into salient keyphrases.
Contribution
The paper presents a novel joint inference approach that clusters paraphrased keyphrases and links them with latent topics to better understand document semantics from noisy annotations.
Findings
Outperforms alternative methods in summarizing documents with keyphrases.
Effectively clusters paraphrased keyphrases despite annotation noise.
Enhances the prediction of semantic properties in unannotated documents.
Abstract
This paper presents a new method for inferring the semantic properties of documents by leveraging free-text keyphrase annotations. Such annotations are becoming increasingly abundant due to the recent dramatic growth in semi-structured, user-generated online content. One especially relevant domain is product reviews, which are often annotated by their authors with pros/cons keyphrases such as a real bargain or good value. These annotations are representative of the underlying semantic properties; however, unlike expert annotations, they are noisy: lay authors may use different labels to denote the same property, and some labels may be missing. To learn using such noisy annotations, we find a hidden paraphrase structure which clusters the keyphrases. The paraphrase structure is linked with a latent topic model of the review texts, enabling the system to predict the properties of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques · Topic Modeling · Text and Document Classification Technologies
