Kaggle LSHTC4 Winning Solution

Antti Puurula; Jesse Read; Albert Bifet

arXiv:1405.0546·cs.AI·May 12, 2014·21 cites

Kaggle LSHTC4 Winning Solution

Antti Puurula, Jesse Read, Albert Bifet

PDF

Open Access

TL;DR

This paper presents a winning ensemble approach for large-scale hierarchical text classification, combining diverse sparse generative models and optimized voting strategies to improve macroFscore.

Contribution

It introduces a novel ensemble method that predicts documents per label using weighted voting, enhancing classification performance in hierarchical text tasks.

Findings

01

Achieved top performance in Kaggle LSHTC4 competition

02

Utilized diverse hierarchical Naive Bayes models with feature pre-processing

03

Optimized macroFscore through label-based prediction and ensemble weighting

Abstract

Our winning submission to the 2014 Kaggle competition for Large Scale Hierarchical Text Classification (LSHTC) consists mostly of an ensemble of sparse generative models extending Multinomial Naive Bayes. The base-classifiers consist of hierarchically smoothed models combining document, label, and hierarchy level Multinomials, with feature pre-processing using variants of TF-IDF and BM25. Additional diversification is introduced by different types of folds and random search optimization for different measures. The ensemble algorithm optimizes macroFscore by predicting the documents for each label, instead of the usual prediction of labels per document. Scores for documents are predicted by weighted voting of base-classifier outputs with a variant of Feature-Weighted Linear Stacking. The number of documents per label is chosen using label priors and thresholding of vote scores. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression

MethodsRandom Search