Unicode at Gigabytes per Second

Daniel Lemire

arXiv:2111.08692·cs.PL·May 23, 2023

Unicode at Gigabytes per Second

Daniel Lemire

PDF

TL;DR

This paper presents a high-speed Unicode validation and transcoding method capable of processing gigabytes of text per second on modern hardware, significantly outperforming existing libraries like ICU.

Contribution

The authors introduce a novel approach and open-source library that achieves tenfold speed improvements in Unicode validation and transcoding over existing solutions.

Findings

01

Achieves gigabytes per second processing speeds

02

Outperforms ICU library by ten times on non-ASCII text

03

Works efficiently on x64 and ARM architectures

Abstract

We often represent text using Unicode formats (UTF-8 and UTF-16). The UTF-8 format is increasingly popular, especially on the web (XML, HTML, JSON, Rust, Go, Swift, Ruby). The UTF-16 format is most common in Java, .NET, and inside operating systems such as Windows. Software systems frequently have to convert text from one Unicode format to the other. While recent disks have bandwidths of 5 GiB/s or more, conventional approaches transcode non-ASCII text at a fraction of a gigabyte per second. We show that we can validate and transcode Unicode text at gigabytes per second on current systems (x64 and ARM) without sacrificing safety. Our open-source library can be ten times faster than the popular ICU library on non-ASCII strings and even faster on ASCII strings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.