Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder

Kin Wai Lau; Yasar Abbas Ur Rehman; Lai-Man Po; Pedro Porto Buarque de Gusm\~ao

arXiv:2603.05528·cs.MM·March 9, 2026

Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder

Kin Wai Lau, Yasar Abbas Ur Rehman, Lai-Man Po, Pedro Porto Buarque de Gusm\~ao

PDF

Open Access

TL;DR

Omni-C introduces a single dense Transformer encoder that effectively learns shared representations across images, audio, and text, reducing complexity and memory usage while maintaining competitive performance in multimodal tasks.

Contribution

The paper presents Omni-C, a unified dense encoder that replaces expert-based models, enabling efficient multimodal learning without routing or large parameter overhead.

Findings

01

Achieves comparable performance to expert models in unimodal and cross-modal tasks.

02

Reduces inference memory usage significantly compared to multi-encoder baselines.

03

Maintains performance with modest zero-shot degradation on audio and text modalities.

Abstract

Recent multimodal systems often rely on separate expert modality encoders which cause linearly scaling complexity and computational overhead with added modalities. While unified Omni-models address this via Mixture-of-Expert (MoE) architectures with specialized experts and routing, they still inflate parameter counts and introduce routing overhead. In this paper, we propose Omni-C (Omni-Compress), a single dense Transformer-based encoder that learns competitive shared representations across heterogeneous modalities--images, audio, and text--through unimodal contrastive pretraining on large-scale unaligned data. By maximizing parameter sharing in the backbone and using lightweight modality-specific projection heads, Omni-C effectively mitigates inter-modality conflicts without requiring MoE, paired supervision, or routing. This design supports efficient deployment on memory-constrained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Speech and Audio Processing