TL;DR
This paper introduces a scalable method for automatically generating high-quality triplet data for composed image retrieval, using synthetic datasets and a novel alignment framework, significantly improving zero-shot and supervised performance.
Contribution
It presents a fully synthetic dataset for CIR and a new hybrid alignment model, enabling scalable training and superior results without manual triplet labeling.
Findings
Achieves state-of-the-art zero-shot performance on CIR benchmarks.
Outperforms existing supervised CIR methods.
Demonstrates the effectiveness of synthetic data for training CIR models.
Abstract
As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- High scalability and engineering efficiency This paper constructs a data generation pipeline using state-of-the-art open-source models (LLM, T2I generative model, MLLM). This implies that the scale of the dataset can be continuously expanded, provided that sufficient GPU resources are available. Furthermore, the data generation process is more engineering-friendly and simpler compared to previous methods like 'compodiff', making it highly suitable for adoption in other research efforts aiming
The contributions emphasized in this paper are summarized as follows: - Proposed a scalable synthesis pipeline to address prior data limitations like **low image quality, unrealistic appearances, and the lack of diversity in captions**, resulting in the large-scale, high-quality CIRHS dataset (534k triplets). - Introduced a novel CIR framework, **Hybrid Contextual Alignment (CoAlign)**, which enhances learned representations by combining global alignment and local reasoning. - Experiments demons
1. The paper is well written and easy to follow. 2. Figure illustrations clearly demonstrate the motivation of the paper and the data generation pipeline.
1. The paper introduces a new CIRHS dataset, but no results concerning the CIRHS dataset are provided, making the significance of the dataset and the contribution of the paper unclear. 2. The core idea of global and local alignment in the CoAlign method is not new, more theoretical insights should be provided to strengthen the novelty of the proposed method.
1. The paper is well-written and well-structured, making it easy to follow and understand the main contributions. 2. The proposed generation pipeline is well-motivated, and the method itself is simple yet effective. 3. The experiments are carefully designed to demonstrate the effectiveness of the approach, and the proposed method shows strong and consistent performance across diverse settings.
1. I believe the main decision point of this paper lies in the contribution of the newly generated dataset. - As shown in Table 5, there already exist numerous prior works that synthetically construct CIR triplets using LLMs, MLLMs, T2I models, or collection-based approaches. The paper should better highlight the differences between the proposed dataset and these existing methods, perhaps with a comparative summary table. While the results demonstrate the effectiveness of the proposed training d
- This work proposed a CoAlign framework that considers both global alignment and local reasoning for the CIR task. The proposed approach is simple yet effective. - This paper demonstrated that with high quality synthetic dataset, it is possible to achieve strong CIR performance under a zero-shot setting. The work conducts a comprehensive evaluation and justifies the effectiveness of the proposed synthetic dataset generation pipeline and CIR framework.
- The work claims this work is the first to demonstrate the feasibility of training CIR models on a fully synthetic dataset; this might be overclaimed. In CompoDiff, Gu et al. 2024a, a synthetic dataset (SynthTriplets18M) is created and used for training the CIR model. In "Improving Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning", ICML 2024, a text-based synthetic data generation approach is applied for existing image-caption pairs for training mult
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
