NatCat: Weakly Supervised Text Classification with Naturally Annotated   Resources

Zewei Chu; Karl Stratos; Kevin Gimpel

arXiv:2009.14335·cs.CL·September 21, 2021

NatCat: Weakly Supervised Text Classification with Naturally Annotated Resources

Zewei Chu, Karl Stratos, Kevin Gimpel

PDF

Open Access 1 Repo

TL;DR

NatCat is a large-scale, weakly supervised text classification resource built from online community data, enabling improved classifiers across multiple tasks without extensive manual labeling.

Contribution

This work introduces NatCat, a novel large-scale dataset from online sources for weakly supervised text classification, and demonstrates its effectiveness across diverse tasks.

Findings

01

Significant performance improvements over prior methods.

02

Different data sources benefit specific tasks.

03

Benchmarking of modeling choices and resource combinations.

Abstract

We describe NatCat, a large-scale resource for text classification constructed from three data sources: Wikipedia, Stack Exchange, and Reddit. NatCat consists of document-category pairs derived from manual curation that occurs naturally within online communities. To demonstrate its usefulness, we build general purpose text classifiers by training on NatCat and evaluate them on a suite of 11 text classification tasks (CatEval), reporting large improvements compared to prior work. We benchmark different modeling choices and resource combinations and show how tasks benefit from particular NatCat data sources.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ZeweiChu/NatCat
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text and Document Classification Technologies · Natural Language Processing Techniques