CACARA: Cross-Modal Alignment Leveraging a Text-Centric Approach for Cost-Effective Multimodal and Multilingual Learning
Diego A. B. Moreira, Alef I. Ferreira, Jhessica Silva, Gabriel O. dos Santos, Gustavo Bonil, Jo\~ao Gondim, Marina dos Santos, Helena Maia, Simone Hashiguti, N\'adia da Silva, Carolina Scarton, Helio Pedrini, Sandra Avila

TL;DR
CACARA introduces a cost-effective, emergent alignment-based multimodal and multilingual model that integrates new modalities and supports over 100 languages without extensive retraining, significantly improving retrieval performance.
Contribution
This work demonstrates that emergent alignment learning enables multimodal and multilingual capabilities from monolingual training, reducing resource requirements and avoiding full retraining.
Findings
Achieves up to 14.24% improvement in R@1 audio-to-text retrieval.
Supports over 100 languages without explicit multilingual pretraining.
Outperforms state-of-the-art models with lower training costs.
Abstract
As deep learning models evolve, new applications and challenges are rapidly emerging. Tasks that once relied on a single modality, such as text, images, or audio, are now enriched by seamless interactions between multimodal data. These connections bridge information gaps: an image can visually materialize a text, while audio can add context to an image. Researchers have developed numerous multimodal models, but most rely on resource-intensive training across multiple modalities. Similarly, extending these models to new languages often follows the same resource-heavy training strategy. In this work, we propose a multimodal and multilingual architecture, CACARA, trained through emergent alignment learning, enabling the seamless integration of new modalities into an existing bimodal/multimodal model without requiring full retraining. This work breaks new ground by demonstrating that this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
