Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

Peter D. Turney (National Research Council of Canada)

arXiv:cs/0212033·cs.LG·May 23, 2007·3 cites

Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL

Peter D. Turney (National Research Council of Canada)

PDF

Open Access

TL;DR

This paper introduces PMI-IR, an unsupervised method for recognizing synonyms using web search data, which outperforms LSA on TOEFL synonym questions with 74% accuracy.

Contribution

The paper presents PMI-IR, a novel unsupervised algorithm leveraging web data for synonym recognition, demonstrating superior performance over LSA on standard tests.

Findings

01

PMI-IR achieves 74% accuracy on TOEFL synonym questions.

02

PMI-IR outperforms LSA, which scores 64%.

03

The approach has potential applications in NLP tasks.

Abstract

This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64% on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques