Benchmarking and Enhancing VLM for Compressed Image Understanding
Zifu Zhang, Tongda Xu, Siqi Li, Shengxi Li, Yue Zhang, Mai Xu, Yan Wang

TL;DR
This paper introduces a comprehensive benchmark to evaluate and improve Vision-Language Models' ability to understand compressed images, revealing performance gaps and proposing a universal adaptor to enhance model robustness across various codecs and bitrates.
Contribution
The work is the first to benchmark VLM performance on compressed images and proposes a universal adaptor to improve understanding across different compression settings.
Findings
VLM performance drops significantly on compressed images.
A universal adaptor improves VLM accuracy by 10-30%.
Performance gap mainly due to generalization failure, not information loss.
Abstract
With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be…
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper focuses on compressed image understanding, which is highly relevant for real-world applications. It addresses a significant deployment gap in current VLM research. * The inclusion of diverse codecs across multiple bitrates provides a thorough evaluation framework. The coverage of distinct tasks ensures a holistic assessment of VLM capabilities. * The proposed VLM adaptor is lightweight and effective, demonstrating 10%-30% improvements across different codecs and bitrates without requ
* As the adaptor is a key contribution of this work, more experimental validation should be provided. The experimental results only presents performance improvements on images compressed by the selected codecs during training, but lacks analysis on the generalization capability of the adaptor on images compressed by other unselected codecs. * It’s better to provide a performance comparison between the proposed adaptor and other existing methods for enhancing VLMs for compressed images understan
I think the paper has following strengths: 1. It set up the first comprehensive benchmark for compressed images with VLMs, the author has experimented on major codecs and bitrates, which convinces me about the robustness of the benchmark. 2. The proposed VLM adaptor is lightweight, codec-agnostic, and improves performance by 10–30% without requiring full model retraining. 3. The author has clear motivation and problem definitions with an information gap (irreversible loss) and a generalizatio
In general, I am satisfied with the paper, but small concerns remain: 1. Though the author claims that the proposed adaptor can have good performance across different codecs and compression methods. I think there are some drawbacks: For VLMs which share the same vision encoders (like Qwen-VL series, or other public VLMs which share CLIP vision encoders), I think the proposed method should also work, i.e., the proposed adaptor could be shared across different VLMs. But the author only uses Qwen
1. The paper provides a large scale benchmark with over one million compressed images to evaluate VLMs. It covers a wide range of image codecs and tasks. 2. It reasonably and effectively identifies and distinguishes the information loss and generalization gaps for VLM loss. 3. The authors propose a lightweight VLM adaptor for the generalization gap, which improves performance by 10%-30% across different codecs and bitrates. 4. The method generally offers potential for real-world applications whe
1. This benchmark is not tested with the latest VLM models, such as GPT-4o and Gemini 2.5. 2. This paper focuces on addressing the generalization gap but does not provide extensive solutions for dealing with information loss due to compression. 3. Since the benchmark is heavily based on specific tasks, the findings and experimental results might not generalize well to tasks that are not covered in the benchmark. 4. The bitrate range is about 0.0-0.30, which is a quite low bitrate range. Experime
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Video Coding and Compression Technologies · Digital Media Forensic Detection
