Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Abdelrahman Shaker; Ahmed Heakl; Jaseel Muhammad; Ritesh Thawkar; Omkar Thawakar; Senmao Li; Hisham Cholakkal; Ian Reid; Eric P. Xing; Salman Khan; and Fahad Shahbaz Khan

arXiv:2602.20161·cs.CV·February 25, 2026

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Abdelrahman Shaker, Ahmed Heakl, Jaseel Muhammad, Ritesh Thawkar, Omkar Thawakar, Senmao Li, Hisham Cholakkal, Ian Reid, Eric P. Xing, Salman Khan, and Fahad Shahbaz Khan

PDF

Open Access 3 Models 3 Datasets

TL;DR

Mobile-O is a compact, efficient multimodal model enabling real-time visual understanding and generation on mobile devices, achieving high performance with minimal computational resources.

Contribution

It introduces Mobile-O, a lightweight vision-language-diffusion model with a novel MCP module, enabling unified multimodal tasks on edge devices with high efficiency.

Findings

01

Achieves 74% on GenEval, outperforming competitors by 5-11%.

02

Runs in ~3 seconds per image on an iPhone, enabling real-time processing.

03

Outperforms existing models in visual understanding benchmarks by 5-15%.

Abstract

Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning