Minimally-Supervised Structure-Rich Text Categorization via Learning on Text-Rich Networks
Xinyang Zhang, Chenwei Zhang, Luna Xin Dong, Jingbo Shang, Jiawei Han

TL;DR
This paper introduces a minimally-supervised text categorization framework that leverages the structure-rich Web data by jointly training text understanding and network learning modules, achieving high accuracy with minimal labeled data.
Contribution
The novel framework combines network-based analysis with deep textual modeling in a co-training setup, enabling effective categorization with very few seed documents.
Findings
Achieves 92% accuracy with only three seed documents per category.
Outperforms existing methods significantly in minimal supervision settings.
Close to supervised BERT performance trained on 50K labels.
Abstract
Text categorization is an essential task in Web content analysis. Considering the ever-evolving Web data and new emerging categories, instead of the laborious supervised setting, in this paper, we focus on the minimally-supervised setting that aims to categorize documents effectively, with a couple of seed documents annotated per category. We recognize that texts collected from the Web are often structure-rich, i.e., accompanied by various metadata. One can easily organize the corpus into a text-rich network, joining raw text documents with document attributes, high-quality phrases, label surface names as nodes, and their associations as edges. Such a network provides a holistic view of the corpus' heterogeneous data sources and enables a joint optimization for network-based analysis and deep textual model training. We therefore propose a novel framework for minimally supervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Topic Modeling · Natural Language Processing Techniques
MethodsLinear Layer · Linear Warmup With Linear Decay · Softmax · Adam · Multi-Head Attention · Residual Connection · Dropout · WordPiece · Attention Dropout · Weight Decay
