A Step Toward Federated Pretraining of Multimodal Large Language Models

Baochen Xiong; Yifan Xu; Xiaoshan Yang; Yaguang Song; Yaowei Wang; Changsheng Xu

arXiv:2603.26786·cs.LG·March 31, 2026

A Step Toward Federated Pretraining of Multimodal Large Language Models

Baochen Xiong, Yifan Xu, Xiaoshan Yang, Yaguang Song, Yaowei Wang, Changsheng Xu

PDF

TL;DR

This paper introduces Fed-CMP, a novel federated pre-training framework for multimodal large language models that addresses key challenges in privacy-preserving collaborative training.

Contribution

It proposes a lightweight federated pre-training paradigm with innovative aggregation and momentum techniques to improve multimodal model alignment.

Findings

01

Fed-CMP outperforms existing baselines in federated pre-training scenarios.

02

The canonical reliability-aware aggregation effectively reduces parameter interference.

03

Orthogonality-preserved momentum maintains geometric structure during training.

Abstract

The rapid evolution of Multimodal Large Language Models (MLLMs) is bottlenecked by the saturation of high-quality public data, while vast amounts of diverse multimodal data remain inaccessible in privacy-sensitive silos. Federated Learning (FL) offers a promising solution to unlock these distributed resources, but existing research focuses predominantly on fine-tuning, leaving the foundational pre-training phase largely unexplored. In this paper, we formally introduce the Federated MLLM Alignment (Fed-MA) task, a lightweight pre-training paradigm that freezes the vision encoder and LLM while collaboratively training the cross-modal projector. We identify two critical challenges in this setting: (i) parameter interference in aggregating local projectors; and (ii) gradient oscillations in one-pass collaborative SGD. To address these challenges, we propose Fed-CMP, a pioneering framework…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.