CARES: Context-Aware Resolution Selector for VLMs
Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz

TL;DR
CARES is a lightweight module that intelligently selects the minimal image resolution needed for large vision-language models, significantly reducing computation without sacrificing accuracy across various tasks.
Contribution
Introduces CARES, a novel context-aware resolution selector that predicts the minimal sufficient resolution for VLMs, enabling substantial compute savings.
Findings
Reduces compute by up to 80% across benchmarks.
Maintains task performance with lower resolution inputs.
Works effectively across diverse multimodal tasks and models.
Abstract
Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
