TL;DR
This paper introduces a prompt-guided prefiltering method for VLM image compression that reduces bitrate by 25-50% while maintaining task accuracy, enhancing efficiency for cloud-based image understanding applications.
Contribution
It presents a novel, lightweight, plug-and-play prefiltering module that adapts to prompt-driven VLMs, improving compression without task-specific assumptions.
Findings
Achieves 25-50% bitrate reduction on VQA benchmarks.
Preserves task-relevant details while smoothing irrelevant regions.
Codec-agnostic and compatible with various encoders.
Abstract
The rapid progress of large Vision-Language Models (VLMs) has enabled a wide range of applications, such as image understanding and Visual Question Answering (VQA). Query images are often uploaded to the cloud, where VLMs are typically hosted, hence efficient image compression becomes crucial. However, traditional human-centric codecs are suboptimal in this setting because they preserve many task-irrelevant details. Existing Image Coding for Machines (ICM) methods also fall short, as they assume a fixed set of downstream tasks and cannot adapt to prompt-driven VLMs with an open-ended variety of objectives. We propose a lightweight, plug-and-play, prompt-guided prefiltering module to identify image regions most relevant to the text prompt, and consequently to the downstream task. The module preserves important details while smoothing out less relevant areas to improve compression…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
