Omni-C: Compressing Heterogeneous Modalities into a Single Dense Encoder
Kin Wai Lau, Yasar Abbas Ur Rehman, Lai-Man Po, Pedro Porto Buarque de Gusm\~ao

TL;DR
Omni-C introduces a single dense Transformer encoder that effectively learns shared representations across images, audio, and text, reducing complexity and memory usage while maintaining competitive performance in multimodal tasks.
Contribution
The paper presents Omni-C, a unified dense encoder that replaces expert-based models, enabling efficient multimodal learning without routing or large parameter overhead.
Findings
Achieves comparable performance to expert models in unimodal and cross-modal tasks.
Reduces inference memory usage significantly compared to multi-encoder baselines.
Maintains performance with modest zero-shot degradation on audio and text modalities.
Abstract
Recent multimodal systems often rely on separate expert modality encoders which cause linearly scaling complexity and computational overhead with added modalities. While unified Omni-models address this via Mixture-of-Expert (MoE) architectures with specialized experts and routing, they still inflate parameter counts and introduce routing overhead. In this paper, we propose Omni-C (Omni-Compress), a single dense Transformer-based encoder that learns competitive shared representations across heterogeneous modalities--images, audio, and text--through unimodal contrastive pretraining on large-scale unaligned data. By maximizing parameter sharing in the backbone and using lightweight modality-specific projection heads, Omni-C effectively mitigates inter-modality conflicts without requiring MoE, paired supervision, or routing. This design supports efficient deployment on memory-constrained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Speech and Audio Processing
