BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages
Benjamin Heinzerling, Michael Strube

TL;DR
BPEmb introduces a resource-efficient, tokenization-free set of pre-trained subword embeddings for 275 languages, demonstrating competitive performance in entity typing tasks without language-specific tokenization.
Contribution
It provides a large multilingual collection of subword embeddings based on BPE, enabling effective NLP applications without the need for language-specific tokenization.
Findings
BPEmb performs competitively in entity typing tasks.
It outperforms some alternative subword methods for certain languages.
Requires fewer resources and no tokenization.
Abstract
We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages bet- ter than alternative subword approaches, while requiring vastly fewer resources and no tokenization. BPEmb is available at https://github.com/bheinzerling/bpemb
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
