GPT-2 as a Compression Preprocessor: Improving Gzip for Structured Text Domains

Anurag Kumar Ojha

arXiv:2508.14061·cs.IR·August 21, 2025

GPT-2 as a Compression Preprocessor: Improving Gzip for Structured Text Domains

Anurag Kumar Ojha

PDF

Open Access

TL;DR

This paper introduces a GPT-2 based preprocessing step to enhance gzip compression efficiency for structured, domain-specific text files like logs and HTML by transforming data into a more compressible form.

Contribution

The paper presents a novel GPT-2 based preprocessing pipeline that improves gzip compression rates for structured domain-specific files, addressing limitations of traditional pattern-based compressors.

Findings

01

Improved compression of defense logs by 0.34%

02

Enhanced HTML file compression by 5.8%

03

Demonstrated effectiveness on real-world and synthetic data

Abstract

In the modern era, large volumes of data are being produced continuously, especially in domain-specific fields such as medical records and clinical files, defence logs and HTML-based web traffic. Data with such volume and complexity needs to be compressed before storing and transmitting efficiently. Data compression has gained significant attention from modern researchers, resulting in the development of fast and efficient compression algorithms such as Gzip. However, since gzip works on the principle of repetition of binary patterns, one of the limitations of gzip is that domain-specific formats like JSON, XML, HTML, and log files, while structured, may have semantic repetition but not syntactic repetition, which gzip finds difficult to compress. In this article, we propose a GPT-based preprocessor for such domain-specific files. We propose a pipeline made up of GPT-2 taking…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Logic, programming, and type systems