From Image to Text Classification: A Novel Approach based on Clustering   Word Embeddings

Andrei M. Butnaru; Radu Tudor Ionescu

arXiv:1707.08098·cs.CL·July 26, 2017

From Image to Text Classification: A Novel Approach based on Clustering Word Embeddings

Andrei M. Butnaru, Radu Tudor Ionescu

PDF

TL;DR

This paper introduces a novel text classification method that uses clustering of word embeddings to create class-specific vocabularies, improving performance over traditional bag of words models.

Contribution

The paper presents a new clustering-based approach for text classification that leverages class-specific vocabularies derived from word embeddings.

Findings

01

Outperforms standard bag of words in text categorization.

02

Effective in polarity classification tasks.

03

Utilizes class-specific vocabularies for better semantic representation.

Abstract

In this paper, we propose a novel approach for text classification based on clustering word embeddings, inspired by the bag of visual words model, which is widely used in computer vision. After each word in a collection of documents is represented as word vector using a pre-trained word embeddings model, a k-means algorithm is applied on the word vectors in order to obtain a fixed-size set of clusters. The centroid of each cluster is interpreted as a super word embedding that embodies all the semantically related word vectors in a certain region of the embedding space. Every embedded word in the collection of documents is then assigned to the nearest cluster centroid. In the end, each document is represented as a bag of super word embeddings by computing the frequency of each super word embedding in the respective document. We also diverge from the idea of building a single vocabulary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.