LLMComp: A Language Modeling Paradigm for Error-Bounded Scientific Data Compression (Technical Report)
Guozhong Li, Muhannad Alhumaidi, Spiros Skiadopoulos, Panos Kalnis

TL;DR
LLMComp introduces a novel error-bounded scientific data compression method using decoder-only large language models, achieving higher compression ratios by modeling data as sequences with locality preservation.
Contribution
The paper presents LLMComp, a new paradigm leveraging large language models for scientific data compression with error bounds, combining quantization, locality-aware tokenization, and autoregressive modeling.
Findings
Outperforms state-of-the-art compressors by up to 30% in compression ratio.
Effectively models complex scientific data sequences with LLMs.
Maintains strict error bounds while improving compression efficiency.
Abstract
The rapid growth of high-resolution scientific simulations and observation systems is generating massive spatiotemporal datasets, making efficient, error-bounded compression increasingly important. Meanwhile, decoder-only large language models (LLMs) have demonstrated remarkable capabilities in modeling complex sequential data. In this paper, we propose LLMCOMP, a novel lossy compression paradigm that leverages decoder-only large LLMs to model scientific data. LLMCOMP first quantizes 3D fields into discrete tokens, arranges them via Z-order curves to preserve locality, and applies coverage-guided sampling to enhance training efficiency. An autoregressive transformer is then trained with spatial-temporal embeddings to model token transitions. During compression, the model performs top-k prediction, storing only rank indices and fallback corrections to ensure strict error bounds.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
