DXML: Distributed Extreme Multilabel Classification

Pawan Kumar

arXiv:2112.10297·cs.DC·December 21, 2021

DXML: Distributed Extreme Multilabel Classification

Pawan Kumar

PDF

Open Access

TL;DR

This paper introduces DXML, a scalable distributed and shared memory system for extreme multilabel classification, combining MPI and OpenMP to improve training and testing efficiency on large datasets.

Contribution

It presents a hybrid distributed-shared memory implementation for extreme multilabel classification, including communication latency analysis and scalability insights.

Findings

01

Faster training and testing on large datasets.

02

Relatively small model sizes in some cases.

03

Provides scalability analysis for similar methods.

Abstract

As a big data application, extreme multilabel classification has emerged as an important research topic with applications in ranking and recommendation of products and items. A scalable hybrid distributed and shared memory implementation of extreme classification for large scale ranking and recommendation is proposed. In particular, the implementation is a mix of message passing using MPI across nodes and using multithreading on the nodes using OpenMP. The expression for communication latency and communication volume is derived. Parallelism using work-span model is derived for shared memory architecture. This throws light on the expected scalability of similar extreme classification methods. Experiments show that the implementation is relatively faster to train and test on some large datasets. In some cases, model size is relatively small.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Neural Networks and Applications