Corpus specificity in LSA and Word2vec: the role of out-of-domain documents
Edgar Altszyler, Mariano Sigman, Diego Fernandez Slezak

TL;DR
This paper compares how corpus size and domain specificity affect the performance of LSA and Word2vec in capturing semantic relations, revealing that Word2vec benefits from larger corpora while LSA improves with domain-specific, reduced datasets.
Contribution
It demonstrates that corpus domain specificity enhances LSA performance and that Word2vec benefits from larger, comprehensive corpora, providing insights into their different mechanisms.
Findings
Word2vec performs best with the entire corpus.
LSA improves with domain-specific, reduced datasets.
Specialization can reduce LSA performance if dimensionality isn't decreased.
Abstract
Latent Semantic Analysis (LSA) and Word2vec are some of the most widely used word embeddings. Despite the popularity of these techniques, the precise mechanisms by which they acquire new semantic relations between words remain unclear. In the present article we investigate whether LSA and Word2vec capacity to identify relevant semantic dimensions increases with size of corpus. One intuitive hypothesis is that the capacity to identify relevant dimensions should increase as the amount of data increases. However, if corpus size grow in topics which are not specific to the domain of interest, signal to noise ratio may weaken. Here we set to examine and distinguish these alternative hypothesis. To investigate the effect of corpus specificity and size in word-embeddings we study two ways for progressive elimination of documents: the elimination of random documents vs. the elimination of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
