Mordal: Automated Pretrained Model Selection for Vision Language Models

Shiqi He; Insu Jang; Mosharaf Chowdhury

arXiv:2502.00241·cs.LG·February 4, 2025

Mordal: Automated Pretrained Model Selection for Vision Language Models

Shiqi He, Insu Jang, Mosharaf Chowdhury

PDF

Open Access 3 Reviews

TL;DR

Mordal is an automated framework that efficiently searches for the optimal vision-language model for specific tasks, significantly reducing computational costs and discovering new high-performing models.

Contribution

Introducing Mordal, the first automated multimodal model search framework that optimizes VLM selection for specific tasks with reduced computational effort.

Findings

01

Mordal reduces search GPU hours by up to 11.6 times.

02

It successfully finds the best VLM for various tasks.

03

Discovered new VLMs outperforming existing models.

Abstract

Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

MORDAL enables rapid screening of visual encoder–language model combinations within open model repositories, saving roughly 9–12× the time compared to exhaustive search, while maintaining top hit rates on most tasks. Its two-level clustering greatly reduces the computational cost of CKA, and the SHA and extrapolation strategies allow reusing intermediate checkpoints, achieving overall engineering efficiency. Experimental results show that MORDAL’s ranking consistency surpasses that of EMMS, LogM

Weaknesses

The experimental setup and implementation in the paper are not fully aligned; it should be clarified whether LoRA fine-tuning is enabled by default, and performance and resource overhead should be reported separately for both modes. There appear to be typos in the extrapolation and early-stopping logic of Algorithm 1, which may require correction. The terms “maximum sample ratio” and “initial ratio” should also be made consistent. The baseline comparison alters the “training-free” assumption, so

Reviewer 02Rating 6Confidence 2

Strengths

- This paper addresses a real problem faced by practitioners - selecting optimal pretrained components for VLMs is currently ad-hoc and computationally expensive. - The authors have conducted extensive experiments across 7 datasets, 49 model combinations, with detailed ablation studies demonstrating the contribution of each component. - The proposed method achieved 8.9×-11.6× speedup over grid search while maintaining great performance.

Weaknesses

- The evaluation focuses on 7B parameter models with MLP projectors. As acknowledged, extending to smaller (1B) or larger (70B) models presents challenges. The framework's effectiveness on other projector architectures (e.g., Q-former) is unexplored. - The performance is sensitive to clustering thresholds ($t_{ve}$, $t_{llm}$), which may hurt its generalization to other unexplored LLM and vision encoders. - The current design optimizes for single tasks independently. It would be better if the

Reviewer 03Rating 6Confidence 1

Strengths

1. Mordal addresses a practical problem where manual selection for VLM is unreliable and grid search is infeasible. 2. Mordal utilizes the proper metric to cluster VLM combinations. 3. The experiments are extensive and results are strong. 4. Mordal requires no manual intervention

Weaknesses

1. Limited Scope and Generalizability. Mordal are architecture-specific and only evaluated on MLP projector-based VLMs. It is unclear if it extends to Q-former or other architectures. 2. Only 7 vision encoders and 7 LLMs are tested. It is unclear how it will work when scaling to 10x model zoo.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling