Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
Peter D. Turney (National Research Council of Canada)

TL;DR
This paper introduces PMI-IR, an unsupervised method for recognizing synonyms using web search data, which outperforms LSA on TOEFL synonym questions with 74% accuracy.
Contribution
The paper presents PMI-IR, a novel unsupervised algorithm leveraging web data for synonym recognition, demonstrating superior performance over LSA on standard tests.
Findings
PMI-IR achieves 74% accuracy on TOEFL synonym questions.
PMI-IR outperforms LSA, which scores 64%.
The approach has potential applications in NLP tasks.
Abstract
This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64% on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Text Analysis Techniques
