SubGram: Extending Skip-gram Word Representation with Substrings

Tom Kocmi; Ond\v{r}ej Bojar

arXiv:1806.06571·cs.CL·July 9, 2020

SubGram: Extending Skip-gram Word Representation with Substrings

Tom Kocmi, Ond\v{r}ej Bojar

PDF

1 Repo

TL;DR

SubGram enhances the Skip-gram word embedding model by incorporating word substrings, leading to improved performance in capturing linguistic information without additional supervision.

Contribution

It introduces a novel extension to Skip-gram that considers word structure, significantly improving embedding quality.

Findings

01

Achieves large gains over original Skip-gram on test set

02

Effectively captures syntactic and semantic information

03

Demonstrates the benefit of modeling word structure

Abstract

Skip-gram (word2vec) is a recent method for creating vector representations of words ("distributed word representations") using a neural network. The representation gained popularity in various areas of natural language processing, because it seems to capture syntactic and semantic information about words without any explicit supervision in this respect. We propose SubGram, a refinement of the Skip-gram model to consider also the word structure during the training process, achieving large gains on the Skip-gram original test set.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tomkocmi/SubGram
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.