Multimodal Lego: Model Merging and Fine-Tuning Across Topologies and Modalities in Biomedicine
Konstantin Hemker, Nikola Simidjievski, Mateja Jamnik

TL;DR
Multimodal Lego (MM-Lego) is a versatile framework that enables merging and fine-tuning of diverse unimodal encoders into effective multimodal models without extensive retraining, especially useful in biomedical applications.
Contribution
Introduces MM-Lego, a universal fusion framework that converts unimodal encoders into multimodal models with minimal fine-tuning, overcoming limitations of existing methods.
Findings
Achieves competitive performance without end-to-end training.
Operates on any unimodal encoder.
Surpasses benchmarks in five out of seven datasets.
Abstract
Learning holistic computational representations in physical, chemical or biological systems requires the ability to process information from different distributions and modalities within the same model. Thus, the demand for multimodal machine learning models has sharply risen for modalities that go beyond vision and language, such as sequences, graphs, time series, or tabular data. While there are many available multimodal fusion and alignment approaches, most of them require end-to-end training, scale quadratically with the number of modalities, cannot handle cases of high modality imbalance in the training set, or are highly topology-specific, making them too restrictive for many biomedical learning tasks. This paper presents Multimodal Lego (MM-Lego), a general-purpose fusion framework to turn any set of encoders into a competitive multimodal model with no or minimal fine-tuning. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsSparse Evolutionary Training
