Kaggle LSHTC4 Winning Solution
Antti Puurula, Jesse Read, Albert Bifet

TL;DR
This paper presents a winning ensemble approach for large-scale hierarchical text classification, combining diverse sparse generative models and optimized voting strategies to improve macroFscore.
Contribution
It introduces a novel ensemble method that predicts documents per label using weighted voting, enhancing classification performance in hierarchical text tasks.
Findings
Achieved top performance in Kaggle LSHTC4 competition
Utilized diverse hierarchical Naive Bayes models with feature pre-processing
Optimized macroFscore through label-based prediction and ensemble weighting
Abstract
Our winning submission to the 2014 Kaggle competition for Large Scale Hierarchical Text Classification (LSHTC) consists mostly of an ensemble of sparse generative models extending Multinomial Naive Bayes. The base-classifiers consist of hierarchically smoothed models combining document, label, and hierarchy level Multinomials, with feature pre-processing using variants of TF-IDF and BM25. Additional diversification is introduced by different types of folds and random search optimization for different measures. The ensemble algorithm optimizes macroFscore by predicting the documents for each label, instead of the usual prediction of labels per document. Scores for documents are predicted by weighted voting of base-classifier outputs with a variant of Feature-Weighted Linear Stacking. The number of documents per label is chosen using label priors and thresholding of vote scores. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
MethodsRandom Search
