Large Language Model for Lossless Image Compression with Visual Prompts
Junhao Du, Chuqin Zhou, Ning Cao, Gang Chen, Yunuo Chen, Zhengxue, Cheng, Li Song, Guo Lu, Wenjun Zhang

TL;DR
This paper presents a novel lossless image compression method that leverages large language models with visual prompts, achieving state-of-the-art results by integrating visual embeddings and residual prediction.
Contribution
Introduces a new paradigm using LLMs with visual prompts for lossless image compression, bridging textual prior knowledge and image data to improve entropy modeling.
Findings
Achieves state-of-the-art compression performance on benchmark datasets.
Effectively extends to medical and screen content images.
Outperforms traditional and existing learning-based codecs.
Abstract
Recent advancements in deep learning have driven significant progress in lossless image compression. With the emergence of Large Language Models (LLMs), preliminary attempts have been made to leverage the extensive prior knowledge embedded in these pretrained models to enhance lossless image compression, particularly by improving the entropy model. However, a significant challenge remains in bridging the gap between the textual prior knowledge within LLMs and lossless image compression. To tackle this challenge and unlock the potential of LLMs, this paper introduces a novel paradigm for lossless image compression that incorporates LLMs with visual prompts. Specifically, we first generate a lossy reconstruction of the input image as visual prompts, from which we extract features to serve as visual embeddings for the LLM. The residual between the original image and the lossy…
Peer Reviews
Decision·Submitted to ICLR 2025
Building on previous work that uses Large Language Models (LLMs) as entropy models for lossless compression, the authors argue that the limited performance gains over traditional (non-learning-based) methods stem from differences between the textual features captured by pre-trained LLMs and the intrinsic characteristics of image pixels. To address this, they propose inputting visual embeddings into the LLM to enhance performance. The concept is clearly presented, the argument is compelling, and
The authors acknowledge the time-consuming nature of their proposed approach. However, it is important to note that this issue arises not only from the autoregressive structure of the method but also from the additional computational load introduced by the visual embeddings compared to previous approaches. A comparison of this aspect would be insightful. The authors also address the role of lossy compression within the framework. They conducted an ablation study on the quantization parameter of
- Instead of directly encoding the image pixels like previous works in this domain, they use off the shelf lossy compression models and encode their residuals. To achieve this, they obtain embeddings for the entire image (global), patches (local) and use it as "prompts" to the language model, along with a GMM (Gaussian mixture model) to accurately model the underlying distribution. - The proposed system might appear to be a hodge·podge of different tricks, but each one contributes meaningfully t
- The most obvious weakness of such a system would be the required compute. Hence, I think a mention about the MACs for encoding and decoding would be good. I wonder if the authors tried any Quantized LLM's for this task - something that is known to be small and occupy much smaller footprints. Eg: [Phi 3 mini](https://huggingface.co/QuantFactory/Phi-3-mini-128k-instruct-GGUF) or even [BitNet](https://github.com/microsoft/BitNet). - For a system that relies on an LLM for most of the heavy lifting
Using LLMs for entropy coding in image compression is a relatively new topic. Experimental results show that the proposed method achieves better coding performance compared to the prior work that also adopts LLMs for entropy coding.
1. According to Table 2, it appears that after applying the global prompt, the additional improvement from using the local prompt is minimal, providing only a 0.8% gain. Furthermore, although the paper emphasizes the importance of optimized embedding, Table 2 shows it provides only a modest 0.4% additional gain. 2. The comparison of complexity with baseline methods (e.g., model size, encoding/decoding MACs, or runtime) is not provided. 3. The practicality of using an LLM as an entropy model is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
