Multilingual Language Processing From Bytes

Dan Gillick; Cliff Brunk; Oriol Vinyals; Amarnag Subramanya

arXiv:1512.00103·cs.CL·April 5, 2016·27 cites

Multilingual Language Processing From Bytes

Dan Gillick, Cliff Brunk, Oriol Vinyals, Amarnag Subramanya

PDF

Open Access

TL;DR

This paper introduces Byte-to-Span, an LSTM-based model that processes raw Unicode bytes to perform multilingual span annotation tasks like POS tagging and NER without relying on language-specific preprocessing or external data.

Contribution

The model operates directly on bytes, enabling a compact, language-agnostic approach that achieves competitive results without traditional NLP pipelines.

Findings

01

Achieves state-of-the-art or comparable results in POS tagging and NER.

02

Operates effectively across multiple languages with a single model.

03

Does not require tokenization or external data.

Abstract

We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary. Because we operate directly on unicode bytes rather than language-specific words or characters, we can analyze text in many languages with a single model. Due to the small vocabulary size, these multilingual models are very compact, but produce results similar to or better than the state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that use only the provided training datasets (no external data sources). Our models are learning "from scratch" in that they do not rely on any elements of the standard pipeline in Natural Language Processing (including tokenization), and thus can run in standalone fashion on raw text.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis