(Almost) Free Modality Stitching of Foundation Models

Jaisidh Singh; Diganta Misra; Boris Knyazev; Antonio Orvieto

arXiv:2507.10015·cs.CV·July 18, 2025

(Almost) Free Modality Stitching of Foundation Models

Jaisidh Singh, Diganta Misra, Boris Knyazev, Antonio Orvieto

PDF

TL;DR

This paper introduces Hyma, a hypernetwork-based framework that significantly reduces the computational cost of selecting and aligning uni-modal models for multi-modal foundation models, achieving near grid search performance.

Contribution

Hyma provides an efficient, all-in-one hypernetwork approach for simultaneous uni-modal model selection and connector training, reducing search costs by tenfold.

Findings

01

Hyma reduces search cost by 10x.

02

Hyma matches grid search performance.

03

Effective across diverse benchmarks.

Abstract

Foundation multi-modal models are often designed by stitching of multiple existing pretrained uni-modal models: for example, an image classifier with an text model. This stitching process is performed by training a connector module that aims to align the representation spaces of these uni-modal models towards a multi-modal objective. However, given the complexity of training such connectors on large scale web-based datasets coupled with the ever-increasing number of available pretrained uni-modal models, the task of uni-modal models selection and subsequent connector module training becomes computationally demanding. To address this under-studied critical problem, we propose Hypernetwork Model Alignment (Hyma), a novel all-in-one solution for optimal uni-modal model selection and connector training by leveraging hypernetworks. Specifically, our framework utilizes the parameter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.