Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models
Mikel Williams-Lekuona, Georgina Cosma

TL;DR
ICAR introduces an adaptive computation method for vision transformers that reduces processing for simple images while maintaining high accuracy, enabling faster and more sustainable vision-language systems.
Contribution
The paper presents ICAR, a novel adaptive retrieval approach with dual-path training for compatible embeddings, and ConvNeXt-IC for efficient image complexity assessment, improving speed and scalability.
Findings
ICAR achieves 20% faster image encoding without performance loss.
ConvNeXt-IC attains a Pearson correlation of 0.959 with human labels.
ICAR enables 4.4x faster complexity prediction.
Abstract
Vision transformers in vision-language models typically use the same amount of compute for every image, regardless of whether it is simple or complex. We propose ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach that enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both the early-exit and full-depth paths. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Text Readability and Simplification
