BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

Benjamin Heinzerling; Michael Strube

arXiv:1710.02187·cs.CL·October 9, 2017·128 cites

BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

Benjamin Heinzerling, Michael Strube

PDF

Open Access 1 Repo

TL;DR

BPEmb introduces a resource-efficient, tokenization-free set of pre-trained subword embeddings for 275 languages, demonstrating competitive performance in entity typing tasks without language-specific tokenization.

Contribution

It provides a large multilingual collection of subword embeddings based on BPE, enabling effective NLP applications without the need for language-specific tokenization.

Findings

01

BPEmb performs competitively in entity typing tasks.

02

It outperforms some alternative subword methods for certain languages.

03

Requires fewer resources and no tokenization.

Abstract

We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages bet- ter than alternative subword approaches, while requiring vastly fewer resources and no tokenization. BPEmb is available at https://github.com/bheinzerling/bpemb

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bheinzerling/bpemb
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification