Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models

Mikel Williams-Lekuona; Georgina Cosma

arXiv:2512.15372·cs.IR·January 16, 2026

Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models

Mikel Williams-Lekuona, Georgina Cosma

PDF

Open Access 2 Models

TL;DR

ICAR introduces an adaptive computation method for vision transformers that reduces processing for simple images while maintaining high accuracy, enabling faster and more sustainable vision-language systems.

Contribution

The paper presents ICAR, a novel adaptive retrieval approach with dual-path training for compatible embeddings, and ConvNeXt-IC for efficient image complexity assessment, improving speed and scalability.

Findings

01

ICAR achieves 20% faster image encoding without performance loss.

02

ConvNeXt-IC attains a Pearson correlation of 0.959 with human labels.

03

ICAR enables 4.4x faster complexity prediction.

Abstract

Vision transformers in vision-language models typically use the same amount of compute for every image, regardless of whether it is simple or complex. We propose ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach that enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both the early-exit and full-depth paths. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Text Readability and Simplification