ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

Hanpeng Liu; Yaqian Li; Zidan Wang; Shuoxi Zhang; Zonglin Zhao; Zihao Bo; Rinyoichi Takezoe; Kaiwen Long; Kun He

arXiv:2603.02767·cs.CV·March 10, 2026

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

Hanpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zonglin Zhao, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He

PDF

Open Access

TL;DR

ITO introduces a novel framework that combines multiple alignment and training-time fusion to improve image-text representations, effectively reducing modality gaps and enhancing performance across various benchmarks.

Contribution

The paper presents ITO, a new method that synergizes multimodal alignment with a lightweight fusion module, improving cross-modal learning without increasing inference complexity.

Findings

01

Outperforms strong baselines in classification, retrieval, and multimodal tasks.

02

Training-time fusion acts as a structural regularizer, stabilizing training.

03

Multiple alignment enhances discriminative power of representations.

Abstract

Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer --…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning