# Making compression algorithms for Unicode text

**Authors:** Adam Gleave, Christian Steinruecken

arXiv: 1701.04047 · 2017-01-17

## TL;DR

This paper introduces a method to adapt byte-based compression algorithms to operate directly on Unicode characters, significantly improving compression of UTF-8 encoded text while maintaining performance on ASCII and binary files.

## Contribution

The authors develop a technique to modify existing byte-based compressors for Unicode text, demonstrating improved compression effectiveness on UTF-8 data.

## Key findings

- PPM variant outperforms state-of-the-art PPMII compressor on UTF-8 corpus
- Modified compressors perform similarly to original on ASCII and binary files
- Method substantially enhances Unicode text compression efficiency

## Abstract

The majority of online content is written in languages other than English, and is most commonly encoded in UTF-8, the world's dominant Unicode character encoding. Traditional compression algorithms typically operate on individual bytes. While this approach works well for the single-byte ASCII encoding, it works poorly for UTF-8, where characters often span multiple bytes. Previous research has focused on developing Unicode compressors from scratch, which often failed to outperform established algorithms such as bzip2. We develop a technique to modify byte-based compressors to operate directly on Unicode characters, and implement variants of LZW and PPM that apply this technique. We find that our method substantially improves compression effectiveness on a UTF-8 corpus, with our PPM variant outperforming the state-of-the-art PPMII compressor. On ASCII and binary files, our variants perform similarly to the original unmodified compressors.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1701.04047/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/1701.04047/full.md

## References

20 references — full list in the complete paper: https://tomesphere.com/paper/1701.04047/full.md

---
Source: https://tomesphere.com/paper/1701.04047