Word2Vec is a special case of Kernel Correspondence Analysis and Kernels for Natural Language Processing
Hirotaka Niitsuma, Minho Lee

TL;DR
This paper demonstrates that Word2Vec is a special case of Kernel Correspondence Analysis (KCA), introduces a memory-efficient semi-supervised KCA for NLP, and proposes a tail-cut kernel that improves word-vector representations.
Contribution
It establishes the equivalence between CA and Gini index, extends CA with kernels for NLP, and introduces a memory-efficient method and a novel kernel for better word embeddings.
Findings
Tail-cut kernel outperforms existing word-vector methods
Memory-efficient CA enables NLP applications with large categories
Kernel extension provides a new analysis framework for language data
Abstract
We show that correspondence analysis (CA) is equivalent to defining a Gini index with appropriately scaled one-hot encoding. Using this relation, we introduce a nonlinear kernel extension to CA. This extended CA gives a known analysis for natural language via specialized kernels that use an appropriate contingency table. We propose a semi-supervised CA, which is a special case of the kernel extension to CA. Because CA requires excessive memory if applied to numerous categories, CA has not been used for natural language processing. We address this problem by introducing delayed evaluation to randomized singular value decomposition. The memory-efficient CA is then applied to a word-vector representation task. We propose a tail-cut kernel, which is an extension to the skip-gram within the kernel extension to CA. Our tail-cut kernel outperforms existing word-vector representation methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
