Introducing Three New Benchmark Datasets for Hierarchical Text Classification
Jaco du Toit, Herman Redelinghuys, Marcel Dunaiski

TL;DR
This paper introduces three new benchmark datasets for hierarchical text classification in scientific publications, improving dataset quality and providing baselines for future research.
Contribution
The paper presents three novel HTC datasets from research publications, combining existing schemas to enhance dataset reliability and offering baseline evaluations of state-of-the-art methods.
Findings
Proposed a combined classification schema for dataset creation.
Demonstrated higher semantic similarity within classes in new datasets.
Provided baseline classification results for future research.
Abstract
Hierarchical Text Classification (HTC) is a natural language processing task with the objective to classify text documents into a set of classes from a structured class hierarchy. Many HTC approaches have been proposed which attempt to leverage the class hierarchy information in various ways to improve classification performance. Machine learning-based classification approaches require large amounts of training data and are most-commonly compared through three established benchmark datasets, which include the Web Of Science (WOS), Reuters Corpus Volume 1 Version 2 (RCV1-V2) and New York Times (NYT) datasets. However, apart from the RCV1-V2 dataset which is well-documented, these datasets are not accompanied with detailed description methodologies. In this paper, we introduce three new HTC benchmark datasets in the domain of research publications which comprise the titles and abstracts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies
MethodsRoIAlign · Sparse Evolutionary Training · 1x1 Convolution · Convolution · Region Proposal Network · Feature Pyramid Network · Hybrid Task Cascade
