An Enhanced Text Compression Approach Using Transformer-based Language   Models

Chowdhury Mofizur Rahman; Mahbub E Sobhani; Anika Tasnim Rodela; and Swakkhar Shatabda

arXiv:2412.15250·cs.CL·December 23, 2024

An Enhanced Text Compression Approach Using Transformer-based Language Models

Chowdhury Mofizur Rahman, Mahbub E Sobhani, Anika Tasnim Rodela, and Swakkhar Shatabda

PDF

TL;DR

This paper introduces RejuvenateForme, a transformer-based text decompression method that combines innovative pre-processing and lossless compression to achieve superior compression ratios and BLEU scores on multiple corpora.

Contribution

It presents a novel transformer-based approach with a new pre-processing technique and lossless compression, achieving state-of-the-art results in text compression and decompression.

Findings

01

Achieves compression ratios of 12.57, 13.38, and 11.42 on different corpora.

02

Attains BLEU scores of 27.31, 25.78, and 50.45, outperforming previous models.

03

Pre-trained T5-Small outperforms prior state-of-the-art models.

Abstract

Text compression shrinks textual data while keeping crucial information, eradicating constraints on storage, bandwidth, and computational efficacy. The integration of lossless compression techniques with transformer-based text decompression has received negligible attention, despite the increasing volume of English text data in communication. The primary barrier in advancing text compression and restoration involves optimizing transformer-based approaches with efficient pre-processing and integrating lossless compression algorithms, that remained unresolved in the prior attempts. Here, we propose a transformer-based method named RejuvenateForme for text decompression, addressing prior issues by harnessing a new pre-processing technique and a lossless compression method. Our meticulous pre-processing technique incorporating the Lempel-Ziv-Welch algorithm achieves compression ratios of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.