LLMComp: A Language Modeling Paradigm for Error-Bounded Scientific Data Compression (Technical Report)

Guozhong Li; Muhannad Alhumaidi; Spiros Skiadopoulos; Panos Kalnis

arXiv:2510.23632·cs.LG·November 6, 2025

LLMComp: A Language Modeling Paradigm for Error-Bounded Scientific Data Compression (Technical Report)

Guozhong Li, Muhannad Alhumaidi, Spiros Skiadopoulos, Panos Kalnis

PDF

TL;DR

LLMComp introduces a novel error-bounded scientific data compression method using decoder-only large language models, achieving higher compression ratios by modeling data as sequences with locality preservation.

Contribution

The paper presents LLMComp, a new paradigm leveraging large language models for scientific data compression with error bounds, combining quantization, locality-aware tokenization, and autoregressive modeling.

Findings

01

Outperforms state-of-the-art compressors by up to 30% in compression ratio.

02

Effectively models complex scientific data sequences with LLMs.

03

Maintains strict error bounds while improving compression efficiency.

Abstract

The rapid growth of high-resolution scientific simulations and observation systems is generating massive spatiotemporal datasets, making efficient, error-bounded compression increasingly important. Meanwhile, decoder-only large language models (LLMs) have demonstrated remarkable capabilities in modeling complex sequential data. In this paper, we propose LLMCOMP, a novel lossy compression paradigm that leverages decoder-only large LLMs to model scientific data. LLMCOMP first quantizes 3D fields into discrete tokens, arranges them via Z-order curves to preserve locality, and applies coverage-guided sampling to enhance training efficiency. An autoregressive transformer is then trained with spatial-temporal embeddings to model token transitions. During compression, the model performs top-k prediction, storing only rank indices and fallback corrections to ensure strict error bounds.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.