UVLM: A Universal Vision-Language Model Loader for Reproducible Multimodal Benchmarking
Joan Perez, Giovanni Fusco

TL;DR
UVLM is a versatile, open-source framework that unifies the loading, configuration, and benchmarking of diverse vision-language models on custom tasks, promoting reproducibility and comparability in multimodal research.
Contribution
The paper introduces UVLM, a unified, Google Colab-compatible framework supporting multiple VLM architectures with standardized evaluation tools for the first comprehensive benchmarking on complex reasoning tasks.
Findings
Supported models include LLaVA-NeXT and Qwen2.5-VL.
Benchmarking conducted on 120 street-view images with increasing reasoning complexity.
UVLM enables consistent comparison of VLMs using identical prompts and evaluation protocols.
Abstract
Vision-Language Models (VLMs) have emerged as powerful tools for image understanding tasks, yet their practical deployment remains hindered by significant architectural heterogeneity across model families. This paper introduces UVLM (Universal Vision-Language Model Loader), a Google Colab-based framework that provides a unified interface for loading, configuring, and benchmarking multiple VLM architectures on custom image analysis tasks. UVLM currently supports two major model families -- LLaVA-NeXT and Qwen2.5-VL -- which differ fundamentally in their vision encoding, tokenization, and decoding strategies. The framework abstracts these differences behind a single inference function, enabling researchers to compare models using identical prompts and evaluation protocols. Key features include a multi-task prompt builder with support for four response types (numeric, category, boolean,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
