Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models
Young Kyun Jang, Ser-nam Lim

TL;DR
This paper introduces a novel cross-modal backward-compatible training method for vision-language models, enabling efficient model upgrades without re-computing large datasets, by using a projection module pretrained with text data.
Contribution
It extends vision-only backward-compatible training to cross-modal retrieval, proposing a projection module that aligns new and old model embeddings efficiently without using the old model during training.
Findings
Effective cross-modal backward compatibility achieved.
Reduces data and computational requirements for model upgrades.
Enables backfill-free upgrades for vision-language models.
Abstract
Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model's embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model's embeddings to those of the old model. This module, pretrained solely…
Peer Reviews
Decision·Submitted to ICLR 2025
- The motivation is meaningful. The paper seeks to avoid the time-consuming updating calculations on the embeddings of a cross-modal retrieval system when a new, improved model is introduced. - The paper aims to develop a backfill-free cross-modal retrieval system and extends Backward-compatible Training from the vision domain to the cross-modal domain, potentially opening a new research direction. - The effectiveness of the proposed method is supported by the experimental results.
- Poor formulation of the paper: There are incorrect citations in the paper. For example, in Lines 116 and 117, “(BT) was first introduced in the study Shen et al. (2020)” should be “(BT) was first introduced in the study (Shen et al. 2020).” Additionally, in Fig. 2, the captions refer to the wrong figures: "text-only pretraining" is on the left side, while XBT is on the right side, but the paper states "above" and "below." - The backward-compatible representation seems counterintuitive. If the
1. The concept of XBT is innovative and addresses a significant practical issue in the deployment of new models in real-world applications. The idea of using a text-only pretrained projection module to align embeddings is interesting and shows potential for efficiency gains. 2. The methodology is well-explained and technically sound. The authors have provided a clear outline of the training process, including the text-only pretraining and the cross-modal backward-compatible training stages. 3. T
1. While the paper claims that XBT reduces the need for image-text pairs, the scalability of the approach in very large-scale systems with real-world data distributions is not fully explored. The author only used a source dataset from BLIP (which is a sythetic dataset), a subset from LAION400M will be benificial for this work. 2. The paper briefly mentions the impact of data quality on XBT performance. A more detailed analysis on how noisy or biased training data might affect the embedding align
1. Comprehensive Experimentation: The paper demonstrates strong empirical validation through extensive experiments across multiple models and datasets. This thorough evaluation effectively showcases the method's generalization capability and robustness across different scenarios. 2. Novel Perspective: The authors present an innovative approach by proposing that aligning a single modality can potentially lead to cross-modal alignment. This insight offers a fresh perspective on handling multimodal
1. Some citation formats are incorrect. 2. Most performance tables are confusing, making it difficult to clearly understand the performance of both old and new retrieval systems. 3. Lack of several analytical experiments, which are detailed in questions.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
