CtrlSynth: Controllable Image Text Synthesis for Data-Efficient   Multimodal Learning

Qingqing Cao; Mahyar Najibi; Sachin Mehta

arXiv:2410.11963·cs.CV·October 17, 2024

CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning

Qingqing Cao, Mahyar Najibi, Sachin Mehta

PDF

Open Access

TL;DR

CtrlSynth introduces a controllable, modular image-text synthesis pipeline that enhances data diversity and robustness in multimodal learning by decomposing and recomposing visual semantics with user-defined policies, leveraging pretrained models.

Contribution

It presents a novel, training-free framework for fine-grained control over synthetic data generation in multimodal tasks, improving model performance across multiple datasets.

Findings

01

Significantly improves zero-shot classification accuracy.

02

Enhances image-text retrieval performance.

03

Boosts compositional reasoning capabilities.

Abstract

Pretraining robust vision or multimodal foundation models (e.g., CLIP) relies on large-scale datasets that may be noisy, potentially misaligned, and have long-tail distributions. Previous works have shown promising results in augmenting datasets by generating synthetic samples. However, they only support domain-specific ad hoc use cases (e.g., either image or text only, but not both), and are limited in data diversity due to a lack of fine-grained control over the synthesis process. In this paper, we design a \emph{controllable} image-text synthesis pipeline, CtrlSynth, for data-efficient and robust multimodal learning. The key idea is to decompose the visual semantics of an image into basic elements, apply user-specified control policies (e.g., remove, add, or replace operations), and recompose them to synthesize images or texts. The decompose and recompose feature in CtrlSynth allows…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Handwritten Text Recognition Techniques

MethodsDiffusion · Contrastive Language-Image Pre-training · High-Order Consensuses