ARMADA: Attribute-Based Multimodal Data Augmentation
Xiaomeng Jin, Jeonghwan Kim, Yu Zhou, Kuan-Hao Huang, Te-Lin Wu,, Nanyun Peng, Heng Ji

TL;DR
ARMADA is a novel multimodal data augmentation framework that uses knowledge-guided manipulation of visual attributes to generate high-quality, semantically consistent image-text pairs, improving model performance across multiple tasks.
Contribution
The paper introduces ARMADA, a new method that leverages knowledge bases and large language models to generate semantically consistent and diverse multimodal data for training.
Findings
Improves data quality for multimodal models
Enhances downstream task performance
Leverages external knowledge for realistic data generation
Abstract
In Multimodal Language Models (MLMs), the cost of manually annotating high-quality image-text pair data for fine-tuning and alignment is extremely high. While existing multimodal data augmentation frameworks propose ways to augment image-text pairs, they either suffer from semantic inconsistency between texts and images, or generate unrealistic images, causing knowledge gap with real world examples. To address these issues, we propose Attribute-based Multimodal Data Augmentation (ARMADA), a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes of the mentioned entities. Specifically, we extract entities and their visual attributes from the original text data, then search for alternative values for the visual attributes under the guidance of knowledge bases (KBs) and large language models (LLMs). We then utilize an image-editing model to edit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTraffic Prediction and Management Techniques
