Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference

Putu Indah Githa Cahyani; Komang David Dananjaya Suartana; Novanto Yudistira

arXiv:2512.20839·cs.CV·December 25, 2025

Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference

Putu Indah Githa Cahyani, Komang David Dananjaya Suartana, Novanto Yudistira

PDF

Open Access

TL;DR

This paper introduces an adaptive visual preprocessing technique that dynamically adjusts image resolution and cropping based on content, significantly reducing inference time and computational load for vision-language models without retraining.

Contribution

It presents a novel content-aware preprocessing method integrated with FastVLM that improves inference efficiency by over 50% without altering the model architecture.

Findings

01

Reduces per-image inference time by over 50%

02

Achieves more than 55% reduction in visual token count

03

Maintains model performance while improving efficiency

Abstract

Vision-Language Models (VLMs) have demonstrated strong performance on multimodal reasoning tasks, but their deployment remains challenging due to high inference latency and computational cost, particularly when processing high-resolution visual inputs. While recent architectures such as FastVLM improve efficiency through optimized vision encoders, existing pipelines still rely on static visual preprocessing, leading to redundant computation for visually simple inputs. In this work, we propose an adaptive visual preprocessing method that dynamically adjusts input resolution and spatial coverage based on image content characteristics. The proposed approach combines content-aware image analysis, adaptive resolution selection, and content-aware cropping to reduce visual redundancy prior to vision encoding. Importantly, the method is integrated with FastVLM without modifying its architecture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning