The Corpus Replication Task

Tobias Eichinger

arXiv:1806.07978·cs.LG·June 22, 2018

The Corpus Replication Task

Tobias Eichinger

PDF

TL;DR

This paper investigates how word2vec captures semantic and relational similarities by proposing the Corpus Replication Task, which generates input texts to produce specific target relations in word embeddings.

Contribution

It introduces the Corpus Replication Task as a new method to analyze how relations are represented in word embeddings, offering insights into the underlying mechanisms.

Findings

01

Word2vec captures various semantic and relational similarities.

02

The Corpus Replication Task enables targeted analysis of relation building.

03

Potential generalization to other relation sets is discussed.

Abstract

In the field of Natural Language Processing (NLP), we revisit the well-known word embedding algorithm word2vec. Word embeddings identify words by vectors such that the words' distributional similarity is captured. Unexpectedly, besides semantic similarity even relational similarity has been shown to be captured in word embeddings generated by word2vec, whence two questions arise. Firstly, which kind of relations are representable in continuous space and secondly, how are relations built. In order to tackle these questions we propose a bottom-up point of view. We call generating input text for which word2vec outputs target relations solving the Corpus Replication Task. Deeming generalizations of this approach to any set of relations possible, we expect solving of the Corpus Replication Task to provide partial answers to the questions.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.