Bridging Compressed Image Latents and Multimodal Large Language Models
Chia-Hao Kao, Cheng Chien, Yu-Jen Tseng, Yi-Hsin Chen, Alessandro, Gnutti, Shao-Yuan Lo, Wen-Hsiao Peng, Riccardo Leonardi

TL;DR
This paper introduces a lightweight framework to adapt compressed image latents for multimodal large language models, enabling efficient vision tasks on resource-constrained devices without retraining entire MLLMs.
Contribution
We propose a novel, general framework that adapts neural image compression outputs for MLLMs, excluding the large MLLM components from training, which is more practical and scalable.
Findings
Achieves high rate-accuracy performance with less complexity.
Compatible with various neural image codecs and MLLMs.
Effective across multiple application scenarios.
Abstract
This paper presents the first-ever study of adapting compressed image latents to suit the needs of downstream vision tasks that adopt Multimodal Large Language Models (MLLMs). MLLMs have extended the success of large language models to modalities (e.g. images) beyond text, but their billion scale hinders deployment on resource-constrained end devices. While cloud-hosted MLLMs could be available, transmitting raw, uncompressed images captured by end devices to the cloud requires an efficient image compression system. To address this, we focus on emerging neural image compression and propose a novel framework with a lightweight transform-neck and a surrogate loss to adapt compressed image latents for MLLM-based vision tasks. Given the huge scale of MLLMs, our framework excludes the entire downstream MLLM except part of its visual encoder from training our system. This stands out from most…
Peer Reviews
Decision·ICLR 2025 Poster
* This paper is easy to follow * The topic of reducing the cost of visual input in MLLMs is of great practical value * The idea of adapting compressed image latent to MLLMs makes sense to me
This paper can be further strengthened by: * One question that remains unclear about the motivation is why we need the proposed method of compressing image latent instead of existing token pruning/merging work on MLLMs such as crossget [1]. These methods can also reduce the costs of MLLMs, and I’d suggest the authors discuss the difference between existing acceleration methods for MLLMs and the proposed method, and highlight their unique contributions to efficient MLLMs. * The proposed method
1. This paper focuses on an interesting scenario: how to achieve image compression in the context of MLLMs. 2. The writing in this paper is well-organized, making it easy to follow and understand the authors' intentions.
1. The proposed method appears to be a general image compression approach, lacking unique design elements specifically tailored for MLLMs. While the authors emphasize that this approach can achieve image compression in a resource-efficient manner to support MLLMs, they do not provide specific comparisons regarding the reduction in training costs—such as training time or computational resources—compared to methods that incorporate MLLMs directly into the training process. 2. The proposed SURROG
1. The paper introduces a lightweight transform-neck and a surrogate loss function, which together reduce computational complexity and avoid the need to back-propagate through the massive MLLMs, making the training process more efficient. 2. The proposed framework is generic and can accommodate various neural image codecs and MLLMs that share the same visual encoder, enhancing its versatility and applicability across different models and tasks. 3. The paper demonstrates the effectiveness of the
1. The paper's approach relies on the assumption that the downstream MLLMs will use the same pre-trained CLIP visual encoder, which may limit the applicability of the proposed method to MLLMs that employ custom or different visual encoders. 2. The paper does not provide a comprehensive comparison with existing image compression methods beyond the context of MLLMs, which could be important for understanding the method's performance in a broader range of applications
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications · Video Analysis and Summarization
MethodsFocus
