GPT-2 as a Compression Preprocessor: Improving Gzip for Structured Text Domains
Anurag Kumar Ojha

TL;DR
This paper introduces a GPT-2 based preprocessing step to enhance gzip compression efficiency for structured, domain-specific text files like logs and HTML by transforming data into a more compressible form.
Contribution
The paper presents a novel GPT-2 based preprocessing pipeline that improves gzip compression rates for structured domain-specific files, addressing limitations of traditional pattern-based compressors.
Findings
Improved compression of defense logs by 0.34%
Enhanced HTML file compression by 5.8%
Demonstrated effectiveness on real-world and synthetic data
Abstract
In the modern era, large volumes of data are being produced continuously, especially in domain-specific fields such as medical records and clinical files, defence logs and HTML-based web traffic. Data with such volume and complexity needs to be compressed before storing and transmitting efficiently. Data compression has gained significant attention from modern researchers, resulting in the development of fast and efficient compression algorithms such as Gzip. However, since gzip works on the principle of repetition of binary patterns, one of the limitations of gzip is that domain-specific formats like JSON, XML, HTML, and log files, while structured, may have semantic repetition but not syntactic repetition, which gzip finds difficult to compress. In this article, we propose a GPT-based preprocessor for such domain-specific files. We propose a pipeline made up of GPT-2 taking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Logic, programming, and type systems
