Lexical Co-occurrence, Statistical Significance, and Word Association

Dipak Chaudhari; Om P. Damani; and Srivatsan Laxman

arXiv:1008.5287·cs.CL·September 1, 2010·23 cites

Lexical Co-occurrence, Statistical Significance, and Word Association

Dipak Chaudhari, Om P. Damani, and Srivatsan Laxman

PDF

Open Access

TL;DR

This paper introduces a theoretical framework for identifying statistically significant lexical co-occurrences in text corpora, emphasizing document-level cues over unigram frequencies to improve detection accuracy.

Contribution

It proposes a novel framework that distinguishes classes of lexical co-occurrences based on document and corpus cues, and evaluates various measures for effectiveness.

Findings

01

Ochiai and CSA measures outperform others in capturing lexical co-occurrence

02

PMI performs poorly compared to alternative measures

03

The framework effectively differentiates co-occurrence classes

Abstract

Lexical co-occurrence is an important cue for detecting word associations. We present a theoretical framework for discovering statistically significant lexical co-occurrences from a given corpus. In contrast with the prevalent practice of giving weightage to unigram frequencies, we focus only on the documents containing both the terms (of a candidate bigram). We detect biases in span distributions of associated words, while being agnostic to variations in global unigram frequencies. Our framework has the fidelity to distinguish different classes of lexical co-occurrences, based on strengths of the document and corpuslevel cues of co-occurrence in the data. We perform extensive experiments on benchmark data sets to study the performance of various co-occurrence measures that are currently known in literature. We find that a relatively obscure measure called Ochiai, and a newly introduced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Topic Modeling · Natural Language Processing Techniques