Investigating the Fundamental Limit: A Feasibility Study of Hybrid-Neural Archival
Marcus Armstrong, ZiWei Qiu, Huy Q. Vo, Arjun Mukherjee

TL;DR
This study explores the potential of Large Language Models for lossless data compression, introducing a novel architecture and addressing hardware non-determinism to measure neural compression capabilities.
Contribution
The paper presents Hybrid-LLM, a proof-of-concept system, and introduces a logit quantization protocol to measure neural compression rates, addressing deployment barriers.
Findings
LLMs achieve 0.39 BPC on memorized data
LLMs achieve 0.75 BPC on unseen data
Inference latency is significantly higher than classical methods
Abstract
Large Language Models (LLMs) possess a theoretical capability to model information density far beyond the limits of classical statistical methods (e.g., Lempel-Ziv). However, utilizing this capability for lossless compression involves navigating severe system constraints, including non-deterministic hardware and prohibitive computational costs. In this work, we present an exploratory study into the feasibility of LLM-based archival systems. We introduce \textbf{Hybrid-LLM}, a proof-of-concept architecture designed to investigate the "entropic capacity" of foundation models in a storage context. \textbf{We identify a critical barrier to deployment:} the "GPU Butterfly Effect," where microscopic hardware non-determinism precludes data recovery. We resolve this via a novel logit quantization protocol, enabling the rigorous measurement of neural compression rates on real-world data. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Big Data and Digital Economy · Data Quality and Management
