
TL;DR
This paper introduces Duncode, a new Unicode encoding method that achieves higher space efficiency than UTF-8, and presents a comprehensive benchmark for evaluating text encoders across 179 languages.
Contribution
The paper proposes Duncode, an innovative encoding scheme that encodes the entire Unicode set more efficiently, and develops a multilingual encoder benchmark covering 179 languages.
Findings
Duncode outperforms UTF-8 in space efficiency.
A benchmark with 179 languages evaluates encoder performance.
Duncode can compress multiple characters into fewer bytes.
Abstract
This paper investigates the employment of various encoders in text transformation, converting characters into bytes. It discusses local encoders such as ASCII and GB-2312, which encode specific characters into shorter bytes, and universal encoders like UTF-8 and UTF-16, which can encode the complete Unicode set with greater space requirements and are gaining widespread acceptance. Other encoders, including SCSU, BOCU-1, and binary encoders, however, lack self-synchronizing capabilities. Duncode is introduced as an innovative encoding method that aims to encode the entire Unicode character set with high space efficiency, akin to local encoders. It has the potential to compress multiple characters of a string into a Duncode unit using fewer bytes. Despite offering less self-synchronizing identification information, Duncode surpasses UTF8 in terms of space efficiency. The application is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · Advanced Data Storage Technologies
