PBoS: Probabilistic Bag-of-Subwords for Generalizing Word Embedding
Zhao Jinman, Shawn Zhong, Xiaomin Zhang, Yingyu Liang

TL;DR
This paper introduces PBoS, a probabilistic model that generates out-of-vocabulary word embeddings using only spelling, improving cross-lingual word similarity and POS tagging tasks without relying on explicit morphological data.
Contribution
The paper presents PBoS, a novel probabilistic model that effectively segments words into subwords and generates embeddings solely from spelling, outperforming previous models.
Findings
PBoS produces meaningful subword segmentations.
PBoS outperforms previous models in cross-lingual word similarity.
PBoS improves POS tagging accuracy.
Abstract
We look into the task of \emph{generalizing} word embeddings: given a set of pre-trained word vectors over a finite vocabulary, the goal is to predict embedding vectors for out-of-vocabulary words, \emph{without} extra contextual information. We rely solely on the spellings of words and propose a model, along with an efficient algorithm, that simultaneously models subword segmentation and computes subword-based compositional word embedding. We call the model probabilistic bag-of-subwords (PBoS), as it applies bag-of-subwords for all possible segmentations based on their likelihood. Inspections and affix prediction experiment show that PBoS is able to produce meaningful subword segmentations and subword rankings without any source of explicit morphological knowledge. Word similarity and POS tagging experiments show clear advantages of PBoS over previous subword-level models in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
