Introducing Three New Benchmark Datasets for Hierarchical Text   Classification

Jaco du Toit; Herman Redelinghuys; Marcel Dunaiski

arXiv:2411.19119·cs.IR·December 2, 2024

Introducing Three New Benchmark Datasets for Hierarchical Text Classification

Jaco du Toit, Herman Redelinghuys, Marcel Dunaiski

PDF

Open Access 1 Datasets

TL;DR

This paper introduces three new benchmark datasets for hierarchical text classification in scientific publications, improving dataset quality and providing baselines for future research.

Contribution

The paper presents three novel HTC datasets from research publications, combining existing schemas to enhance dataset reliability and offering baseline evaluations of state-of-the-art methods.

Findings

01

Proposed a combined classification schema for dataset creation.

02

Demonstrated higher semantic similarity within classes in new datasets.

03

Provided baseline classification results for future research.

Abstract

Hierarchical Text Classification (HTC) is a natural language processing task with the objective to classify text documents into a set of classes from a structured class hierarchy. Many HTC approaches have been proposed which attempt to leverage the class hierarchy information in various ways to improve classification performance. Machine learning-based classification approaches require large amounts of training data and are most-commonly compared through three established benchmark datasets, which include the Web Of Science (WOS), Reuters Corpus Volume 1 Version 2 (RCV1-V2) and New York Times (NYT) datasets. However, apart from the RCV1-V2 dataset which is well-documented, these datasets are not accompanied with detailed description methodologies. In this paper, we introduce three new HTC benchmark datasets in the domain of research publications which comprise the titles and abstracts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

marcelsun/wos_hierarchical_multi_label_text_classification
dataset· 75 dl
75 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies

MethodsRoIAlign · Sparse Evolutionary Training · 1x1 Convolution · Convolution · Region Proposal Network · Feature Pyramid Network · Hybrid Task Cascade