Hierarchical Metadata-Aware Document Categorization under Weak   Supervision

Yu Zhang; Xiusi Chen; Yu Meng; Jiawei Han

arXiv:2010.13556·cs.CL·October 24, 2023·1 cites

Hierarchical Metadata-Aware Document Categorization under Weak Supervision

Yu Zhang, Xiusi Chen, Yu Meng, Jiawei Han

PDF

Open Access 1 Repo

TL;DR

This paper introduces HiMeCat, a generative framework that integrates hierarchical labels, metadata, and text signals to improve document categorization under weak supervision, especially when annotated data is scarce.

Contribution

It proposes a novel joint embedding-based representation learning and a hierarchical data augmentation method for weakly supervised hierarchical document classification.

Findings

01

HiMeCat outperforms competitive baselines.

02

Representation learning improves categorization accuracy.

03

Hierarchical data augmentation enhances training data diversity.

Abstract

Categorizing documents into a given label hierarchy is intuitively appealing due to the ubiquity of hierarchical topic structures in massive text corpora. Although related studies have achieved satisfying performance in fully supervised hierarchical document classification, they usually require massive human-annotated training data and only utilize text information. However, in many domains, (1) annotations are quite expensive where very few training samples can be acquired; (2) documents are accompanied by metadata information. Hence, this paper studies how to integrate the label hierarchy, metadata, and text signals for document categorization under weak supervision. We develop HiMeCat, an embedding-based generative framework for our task. Specifically, we propose a novel joint representation learning module that allows simultaneous modeling of category dependencies, metadata…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuzhimanhua/HIMECat
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text and Document Classification Technologies · Natural Language Processing Techniques