Unsupervised Learning for Lexicon-Based Classification

Jacob Eisenstein

arXiv:1611.06933·cs.LG·November 22, 2016

Unsupervised Learning for Lexicon-Based Classification

Jacob Eisenstein

PDF

1 Repo

TL;DR

This paper provides a probabilistic foundation for lexicon-based classification, demonstrating how to learn word weights from unlabeled data to improve accuracy over traditional heuristics.

Contribution

It introduces a method to derive word weights from co-occurrence statistics without labeled data, enhancing lexicon-based classification performance.

Findings

01

Learned word weights improve classification accuracy.

02

Probabilistic justification for lexicon heuristics.

03

Outperforms traditional word-counting methods.

Abstract

In lexicon-based classification, documents are assigned labels by comparing the number of words that appear from two opposed lexicons, such as positive and negative sentiment. Creating such words lists is often easier than labeling instances, and they can be debugged by non-experts if classification performance is unsatisfactory. However, there is little analysis or justification of this classification heuristic. This paper describes a set of assumptions that can be used to derive a probabilistic justification for lexicon-based classification, as well as an analysis of its expected accuracy. One key assumption behind lexicon-based classification is that all words in each lexicon are equally predictive. This is rarely true in practice, which is why lexicon-based approaches are usually outperformed by supervised classifiers that learn distinct weights on each word from labeled instances.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jacobeisenstein/probabilistic-lexicon-classification
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.