Llamazip: Leveraging LLaMA for Lossless Text Compression and Training Dataset Detection
S\"oren Dr\'eano, Derek Molloy, Noel Murphy

TL;DR
Llamazip is a lossless text compression method leveraging LLaMA's predictive power, reducing data size and enabling detection of training data origin, thus enhancing storage efficiency and transparency in language models.
Contribution
Introduces Llamazip, a novel compression algorithm based on LLaMA, and demonstrates its ability to identify training data provenance, addressing data privacy and transparency issues.
Findings
Llamazip achieves high compression ratios with minimal data loss.
The method can reliably detect if a document was part of the training dataset.
Performance is influenced by quantization and context window size.
Abstract
This work introduces Llamazip, a novel lossless text compression algorithm based on the predictive capabilities of the LLaMA3 language model. Llamazip achieves significant data reduction by only storing tokens that the model fails to predict, optimizing storage efficiency without compromising data integrity. Key factors affecting its performance, including quantization and context window size, are analyzed, revealing their impact on compression ratios and computational requirements. Beyond compression, Llamazip demonstrates the potential to identify whether a document was part of the training dataset of a language model. This capability addresses critical concerns about data provenance, intellectual property, and transparency in language model training.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Algorithms and Data Compression · Scientific Computing and Data Management
