TL;DR
SentencePiece introduces a language-independent subword tokenizer and detokenizer that can be trained directly from raw text, enabling end-to-end neural text processing with comparable translation accuracy.
Contribution
It provides a novel, end-to-end subword segmentation tool that works directly on raw sentences, unlike previous tools requiring pre-tokenized input.
Findings
Achieves comparable accuracy to traditional methods in English-Japanese NMT
Supports training directly from raw sentences, simplifying preprocessing
Available as open-source software for easy adoption
Abstract
This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/sentencepiece.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗airesearch/wangchanberta-base-att-spm-uncasedmodel· 90k dl· ♡ 4990k dl♡ 49
- 🤗casehold/custom-legalbertmodel· 11k dl· ♡ 1711k dl♡ 17
- 🤗sqllama/sqllama-V0model· ♡ 4♡ 4
- 🤗chime-dasr/nemo_baseline_modelsmodel· 48 dl· ♡ 348 dl♡ 3
- 🤗ancatmara/historical-irish-tokenizer-sentencepiecemodel
- 🤗poomiiz/moon-thai-bert-emotionmodel· 3 dl3 dl
- 🤗Sandipan1976/legalbert-legalopsmodel· ♡ 1♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsByte Pair Encoding · SentencePiece
