Bridging Compressed Image Latents and Multimodal Large Language Models

Chia-Hao Kao; Cheng Chien; Yu-Jen Tseng; Yi-Hsin Chen; Alessandro; Gnutti; Shao-Yuan Lo; Wen-Hsiao Peng; Riccardo Leonardi

arXiv:2407.19651·cs.CV·February 18, 2025

Bridging Compressed Image Latents and Multimodal Large Language Models

Chia-Hao Kao, Cheng Chien, Yu-Jen Tseng, Yi-Hsin Chen, Alessandro, Gnutti, Shao-Yuan Lo, Wen-Hsiao Peng, Riccardo Leonardi

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces a lightweight framework to adapt compressed image latents for multimodal large language models, enabling efficient vision tasks on resource-constrained devices without retraining entire MLLMs.

Contribution

We propose a novel, general framework that adapts neural image compression outputs for MLLMs, excluding the large MLLM components from training, which is more practical and scalable.

Findings

01

Achieves high rate-accuracy performance with less complexity.

02

Compatible with various neural image codecs and MLLMs.

03

Effective across multiple application scenarios.

Abstract

This paper presents the first-ever study of adapting compressed image latents to suit the needs of downstream vision tasks that adopt Multimodal Large Language Models (MLLMs). MLLMs have extended the success of large language models to modalities (e.g. images) beyond text, but their billion scale hinders deployment on resource-constrained end devices. While cloud-hosted MLLMs could be available, transmitting raw, uncompressed images captured by end devices to the cloud requires an efficient image compression system. To address this, we focus on emerging neural image compression and propose a novel framework with a lightweight transform-neck and a surrogate loss to adapt compressed image latents for MLLM-based vision tasks. Given the huge scale of MLLMs, our framework excludes the entire downstream MLLM except part of its visual encoder from training our system. This stands out from most…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

* This paper is easy to follow * The topic of reducing the cost of visual input in MLLMs is of great practical value * The idea of adapting compressed image latent to MLLMs makes sense to me

Weaknesses

This paper can be further strengthened by: * One question that remains unclear about the motivation is why we need the proposed method of compressing image latent instead of existing token pruning/merging work on MLLMs such as crossget [1]. These methods can also reduce the costs of MLLMs, and I’d suggest the authors discuss the difference between existing acceleration methods for MLLMs and the proposed method, and highlight their unique contributions to efficient MLLMs. * The proposed method

Reviewer 02Rating 6Confidence 3

Strengths

1. This paper focuses on an interesting scenario: how to achieve image compression in the context of MLLMs. 2. The writing in this paper is well-organized, making it easy to follow and understand the authors' intentions.

Weaknesses

1. The proposed method appears to be a general image compression approach, lacking unique design elements specifically tailored for MLLMs. While the authors emphasize that this approach can achieve image compression in a resource-efficient manner to support MLLMs, they do not provide specific comparisons regarding the reduction in training costs—such as training time or computational resources—compared to methods that incorporate MLLMs directly into the training process. 2. The proposed SURROG

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper introduces a lightweight transform-neck and a surrogate loss function, which together reduce computational complexity and avoid the need to back-propagate through the massive MLLMs, making the training process more efficient. 2. The proposed framework is generic and can accommodate various neural image codecs and MLLMs that share the same visual encoder, enhancing its versatility and applicability across different models and tasks. 3. The paper demonstrates the effectiveness of the

Weaknesses

1. The paper's approach relies on the assumption that the downstream MLLMs will use the same pre-trained CLIP visual encoder, which may limit the applicability of the proposed method to MLLMs that employ custom or different visual encoders. 2. The paper does not provide a comprehensive comparison with existing image compression methods beyond the context of MLLMs, which could be important for understanding the method's performance in a broader range of applications

Videos

Bridging Compressed Image Latents and Multimodal Large Language Models· slideslive

Taxonomy

TopicsImage Retrieval and Classification Techniques · Multimodal Machine Learning Applications · Video Analysis and Summarization

MethodsFocus