Duncode Characters Shorter

Changshang Xue

arXiv:2307.05414·cs.CL·July 12, 2023

Duncode Characters Shorter

Changshang Xue

PDF

Open Access 2 Repos

TL;DR

This paper introduces Duncode, a new Unicode encoding method that achieves higher space efficiency than UTF-8, and presents a comprehensive benchmark for evaluating text encoders across 179 languages.

Contribution

The paper proposes Duncode, an innovative encoding scheme that encodes the entire Unicode set more efficiently, and develops a multilingual encoder benchmark covering 179 languages.

Findings

01

Duncode outperforms UTF-8 in space efficiency.

02

A benchmark with 179 languages evaluates encoder performance.

03

Duncode can compress multiple characters into fewer bytes.

Abstract

This paper investigates the employment of various encoders in text transformation, converting characters into bytes. It discusses local encoders such as ASCII and GB-2312, which encode specific characters into shorter bytes, and universal encoders like UTF-8 and UTF-16, which can encode the complete Unicode set with greater space requirements and are gaining widespread acceptance. Other encoders, including SCSU, BOCU-1, and binary encoders, however, lack self-synchronizing capabilities. Duncode is introduced as an innovative encoding method that aims to encode the entire Unicode character set with high space efficiency, akin to local encoders. It has the potential to compress multiple characters of a string into a Duncode unit using fewer bytes. Despite offering less self-synchronizing identification information, Duncode surpasses UTF8 in terms of space efficiency. The application is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · Advanced Data Storage Technologies