Hierarchical Optimal Transport for Document Representation

Mikhail Yurochkin; Sebastian Claici; Edward Chien; Farzaneh Mirzazadeh; and Justin Solomon

arXiv:1906.10827·cs.LG·November 5, 2019·35 cites

Hierarchical Optimal Transport for Document Representation

Mikhail Yurochkin, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, and Justin Solomon

PDF

Open Access 1 Repo

TL;DR

This paper introduces hierarchical optimal transport as a scalable and interpretable method for measuring document similarity by modeling documents as distributions over topics and solving an optimal transport problem at the topic level.

Contribution

It proposes a novel hierarchical optimal transport framework that improves scalability and interpretability in document similarity measurement compared to existing methods.

Findings

01

Better interpretability in document similarity

02

Enhanced scalability over previous methods

03

Comparable classification performance

Abstract

The ability to measure similarity between documents enables intelligent summarization and analysis of large corpora. Past distances between documents suffer from either an inability to incorporate semantic similarities between words or from scalability issues. As an alternative, we introduce hierarchical optimal transport as a meta-distance between documents, where documents are modeled as distributions over topics, which themselves are modeled as distributions over words. We then solve an optimal transport problem on the smaller topic space to compute a similarity score. We give conditions on the topics under which this construction defines a distance, and we relate it to the word mover's distance. We evaluate our technique for k-NN classification and show better interpretability and scalability with comparable performance to current methods at a fraction of the cost.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

IBM/HOTT
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Algorithms and Data Compression · Image Retrieval and Classification Techniques

MethodsInterpretability · k-Nearest Neighbors