Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach
Timo Fudala, Vasileios Tsouvalas, Nirvana Meratnia

TL;DR
This paper introduces MPSL, a parallel split learning method enabling efficient, scalable fine-tuning of multimodal transformers on edge devices without sharing labels or requiring client synchronization.
Contribution
MPSL is a novel parallel split learning approach that reduces computation and communication costs for multimodal transformer fine-tuning on resource-constrained devices.
Findings
MPSL matches or outperforms Federated Learning in accuracy.
Client-side computation is reduced by 250x.
MPSL offers superior scalability with model growth.
Abstract
Multimodal transformers integrate diverse data types like images, audio, and text, advancing tasks such as audio-visual understanding and image-text retrieval; yet their high parameterization limits deployment on resource-constrained edge devices. Split Learning (SL), which partitions models at a designated cut-layer to offload compute-intensive operations to the server, offers a promising approach for distributed training of multimodal transformers, though its application remains underexplored. We present MPSL, a parallel SL approach for computational efficient fine-tuning of multimodal transformers in a distributed manner, while eliminating label sharing, client synchronization, and per-client sub-model management. MPSL employs lightweight client-side tokenizers and a unified modality-agnostic encoder, allowing flexible adaptation to task-specific needs. Our evaluation across 7…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
