NatCat: Weakly Supervised Text Classification with Naturally Annotated Resources
Zewei Chu, Karl Stratos, Kevin Gimpel

TL;DR
NatCat is a large-scale, weakly supervised text classification resource built from online community data, enabling improved classifiers across multiple tasks without extensive manual labeling.
Contribution
This work introduces NatCat, a novel large-scale dataset from online sources for weakly supervised text classification, and demonstrates its effectiveness across diverse tasks.
Findings
Significant performance improvements over prior methods.
Different data sources benefit specific tasks.
Benchmarking of modeling choices and resource combinations.
Abstract
We describe NatCat, a large-scale resource for text classification constructed from three data sources: Wikipedia, Stack Exchange, and Reddit. NatCat consists of document-category pairs derived from manual curation that occurs naturally within online communities. To demonstrate its usefulness, we build general purpose text classifiers by training on NatCat and evaluate them on a suite of 11 text classification tasks (CatEval), reporting large improvements compared to prior work. We benchmark different modeling choices and resource combinations and show how tasks benefit from particular NatCat data sources.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Natural Language Processing Techniques
