World Knowledge as Indirect Supervision for Document Clustering
Chenguang Wang, Yangqiu Song, Dan Roth, Ming Zhang, Jiawei Han

TL;DR
This paper explores using general-purpose world knowledge as indirect supervision to improve document clustering, addressing supervision costs and domain adaptation challenges.
Contribution
It introduces methods to adapt and represent world knowledge for domain-specific clustering and proposes a new clustering algorithm leveraging heterogeneous information networks.
Findings
Significant improvement over state-of-the-art clustering methods.
Effective use of Freebase and YAGO2 knowledge bases.
Enhanced clustering performance with world knowledge integration.
Abstract
One of the key obstacles in making learning protocols realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. We consider the framework to use the world knowledge as indirect supervision. World knowledge is general-purpose knowledge, which is not designed for any specific domain. Then the key challenges are how to adapt the world knowledge to domains and how to represent it for learning. In this paper, we provide an example of using world knowledge for domain dependent document clustering. We provide three ways to specify the world knowledge to domains by resolving the ambiguity of the entities and their types, and represent the data with world knowledge as a heterogeneous information network. Then we propose a clustering algorithm that can cluster multiple types and incorporate the sub-type information as constraints. In the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
