ARMADA: Attribute-Based Multimodal Data Augmentation

Xiaomeng Jin; Jeonghwan Kim; Yu Zhou; Kuan-Hao Huang; Te-Lin Wu,; Nanyun Peng; Heng Ji

arXiv:2408.10086·cs.AI·August 20, 2024

ARMADA: Attribute-Based Multimodal Data Augmentation

Xiaomeng Jin, Jeonghwan Kim, Yu Zhou, Kuan-Hao Huang, Te-Lin Wu,, Nanyun Peng, Heng Ji

PDF

Open Access 1 Video

TL;DR

ARMADA is a novel multimodal data augmentation framework that uses knowledge-guided manipulation of visual attributes to generate high-quality, semantically consistent image-text pairs, improving model performance across multiple tasks.

Contribution

The paper introduces ARMADA, a new method that leverages knowledge bases and large language models to generate semantically consistent and diverse multimodal data for training.

Findings

01

Improves data quality for multimodal models

02

Enhances downstream task performance

03

Leverages external knowledge for realistic data generation

Abstract

In Multimodal Language Models (MLMs), the cost of manually annotating high-quality image-text pair data for fine-tuning and alignment is extremely high. While existing multimodal data augmentation frameworks propose ways to augment image-text pairs, they either suffer from semantic inconsistency between texts and images, or generate unrealistic images, causing knowledge gap with real world examples. To address these issues, we propose Attribute-based Multimodal Data Augmentation (ARMADA), a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes of the mentioned entities. Specifically, we extract entities and their visual attributes from the original text data, then search for alternative values for the visual attributes under the guidance of knowledge bases (KBs) and large language models (LLMs). We then utilize an image-editing model to edit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ARMADA: Attribute-Based Multimodal Data Augmentation· underline

Taxonomy

TopicsTraffic Prediction and Management Techniques