PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding

Zhao Jinman; Shawn Zhong; Xiaomin Zhang; Yingyu Liang

arXiv:2010.10813·cs.CL·October 22, 2020

PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding

Zhao Jinman, Shawn Zhong, Xiaomin Zhang, Yingyu Liang

PDF

Open Access 1 Repo

TL;DR

This paper introduces PBoS, a probabilistic model that generates out-of-vocabulary word embeddings using only spelling, improving cross-lingual word similarity and POS tagging tasks without relying on explicit morphological data.

Contribution

The paper presents PBoS, a novel probabilistic model that effectively segments words into subwords and generates embeddings solely from spelling, outperforming previous models.

Findings

01

PBoS produces meaningful subword segmentations.

02

PBoS outperforms previous models in cross-lingual word similarity.

03

PBoS improves POS tagging accuracy.

Abstract

We look into the task of \emph{generalizing} word embeddings: given a set of pre-trained word vectors over a finite vocabulary, the goal is to predict embedding vectors for out-of-vocabulary words, \emph{without} extra contextual information. We rely solely on the spellings of words and propose a model, along with an efficient algorithm, that simultaneously models subword segmentation and computes subword-based compositional word embedding. We call the model probabilistic bag-of-subwords (PBoS), as it applies bag-of-subwords for all possible segmentations based on their likelihood. Inspections and affix prediction experiment show that PBoS is able to produce meaningful subword segmentations and subword rankings without any source of explicit morphological knowledge. Word similarity and POS tagging experiments show clear advantages of PBoS over previous subword-level models in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jmzhao/pbos
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies