ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion
Hanpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zonglin Zhao, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He

TL;DR
ITO introduces a novel framework that combines multiple alignment and training-time fusion to improve image-text representations, effectively reducing modality gaps and enhancing performance across various benchmarks.
Contribution
The paper presents ITO, a new method that synergizes multimodal alignment with a lightweight fusion module, improving cross-modal learning without increasing inference complexity.
Findings
Outperforms strong baselines in classification, retrieval, and multimodal tasks.
Training-time fusion acts as a structural regularizer, stabilizing training.
Multiple alignment enhances discriminative power of representations.
Abstract
Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
