Lexical Co-occurrence, Statistical Significance, and Word Association
Dipak Chaudhari, Om P. Damani, and Srivatsan Laxman

TL;DR
This paper introduces a theoretical framework for identifying statistically significant lexical co-occurrences in text corpora, emphasizing document-level cues over unigram frequencies to improve detection accuracy.
Contribution
It proposes a novel framework that distinguishes classes of lexical co-occurrences based on document and corpus cues, and evaluates various measures for effectiveness.
Findings
Ochiai and CSA measures outperform others in capturing lexical co-occurrence
PMI performs poorly compared to alternative measures
The framework effectively differentiates co-occurrence classes
Abstract
Lexical co-occurrence is an important cue for detecting word associations. We present a theoretical framework for discovering statistically significant lexical co-occurrences from a given corpus. In contrast with the prevalent practice of giving weightage to unigram frequencies, we focus only on the documents containing both the terms (of a candidate bigram). We detect biases in span distributions of associated words, while being agnostic to variations in global unigram frequencies. Our framework has the fidelity to distinguish different classes of lexical co-occurrences, based on strengths of the document and corpuslevel cues of co-occurrence in the data. We perform extensive experiments on benchmark data sets to study the performance of various co-occurrence measures that are currently known in literature. We find that a relatively obscure measure called Ochiai, and a newly introduced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Topic Modeling · Natural Language Processing Techniques
