Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding
Roberto Tacconelli

TL;DR
Nacrith is a novel lossless compression system that combines a large transformer language model with ensemble predictors and advanced coding techniques, achieving state-of-the-art results on natural language text and binary files.
Contribution
It introduces several innovations including high-precision CDF coding, a token-level N-gram predictor, adaptive bias correction, confidence-based skipping, and a hybrid binary format, advancing neural lossless compression.
Findings
Achieves 0.918 bpb on Canterbury Corpus, outperforming traditional compressors.
Surpasses ts_zip and FineZip on enwik8 with a smaller model.
Maintains high performance on out-of-distribution data, confirming robustness.
Abstract
We present Nacrith, a lossless compression system that combines a 135M-parameter transformer language model (SmolLM2-135M) with an ensemble of lightweight online predictors and a 32-bit arithmetic coder, achieving the best compression results among the systems evaluated in this study on natural language text. Beyond the base LLM-plus-arithmetic-coding paradigm, Nacrith introduces several contributions: (1) a CDF precision upgrade from 2^16 to 2^24 that eliminates ~75% of quantization overhead caused by minimum-probability floors in large vocabularies; (2) a token-level N-gram model for fast local predictions; (3) an adaptive log-space bias head correcting per-document LLM errors via online gradient descent; (4) confidence-based LLM skip for accelerating highly predictable tokens; (5) a hybrid binary format (NC06) extending neural compression to arbitrary binary files--to our knowledge a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Data Compression Techniques · Natural Language Processing Techniques
