Hierarchical Metadata-Aware Document Categorization under Weak Supervision
Yu Zhang, Xiusi Chen, Yu Meng, Jiawei Han

TL;DR
This paper introduces HiMeCat, a generative framework that integrates hierarchical labels, metadata, and text signals to improve document categorization under weak supervision, especially when annotated data is scarce.
Contribution
It proposes a novel joint embedding-based representation learning and a hierarchical data augmentation method for weakly supervised hierarchical document classification.
Findings
HiMeCat outperforms competitive baselines.
Representation learning improves categorization accuracy.
Hierarchical data augmentation enhances training data diversity.
Abstract
Categorizing documents into a given label hierarchy is intuitively appealing due to the ubiquity of hierarchical topic structures in massive text corpora. Although related studies have achieved satisfying performance in fully supervised hierarchical document classification, they usually require massive human-annotated training data and only utilize text information. However, in many domains, (1) annotations are quite expensive where very few training samples can be acquired; (2) documents are accompanied by metadata information. Hence, this paper studies how to integrate the label hierarchy, metadata, and text signals for document categorization under weak supervision. We develop HiMeCat, an embedding-based generative framework for our task. Specifically, we propose a novel joint representation learning module that allows simultaneous modeling of category dependencies, metadata…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Natural Language Processing Techniques
