An Enhanced Text Compression Approach Using Transformer-based Language Models
Chowdhury Mofizur Rahman, Mahbub E Sobhani, Anika Tasnim Rodela, and Swakkhar Shatabda

TL;DR
This paper introduces RejuvenateForme, a transformer-based text decompression method that combines innovative pre-processing and lossless compression to achieve superior compression ratios and BLEU scores on multiple corpora.
Contribution
It presents a novel transformer-based approach with a new pre-processing technique and lossless compression, achieving state-of-the-art results in text compression and decompression.
Findings
Achieves compression ratios of 12.57, 13.38, and 11.42 on different corpora.
Attains BLEU scores of 27.31, 25.78, and 50.45, outperforming previous models.
Pre-trained T5-Small outperforms prior state-of-the-art models.
Abstract
Text compression shrinks textual data while keeping crucial information, eradicating constraints on storage, bandwidth, and computational efficacy. The integration of lossless compression techniques with transformer-based text decompression has received negligible attention, despite the increasing volume of English text data in communication. The primary barrier in advancing text compression and restoration involves optimizing transformer-based approaches with efficient pre-processing and integrating lossless compression algorithms, that remained unresolved in the prior attempts. Here, we propose a transformer-based method named RejuvenateForme for text decompression, addressing prior issues by harnessing a new pre-processing technique and a lossless compression method. Our meticulous pre-processing technique incorporating the Lempel-Ziv-Welch algorithm achieves compression ratios of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
