Multimodal Large Language Models for Image, Text, and Speech Data   Augmentation: A Survey

Ranjan Sapkota; Shaina Raza; Maged Shoman; Achyut Paudel; Manoj Karkee

arXiv:2501.18648·cs.CV·March 25, 2025·2 cites

Multimodal Large Language Models for Image, Text, and Speech Data Augmentation: A Survey

Ranjan Sapkota, Shaina Raza, Maged Shoman, Achyut Paudel, Manoj Karkee

PDF

Open Access 1 Repo

TL;DR

This survey reviews recent advancements in multimodal large language models used for data augmentation across image, text, and speech modalities, highlighting methods, limitations, and future directions.

Contribution

It provides a comprehensive overview of multimodal LLM-based data augmentation techniques and discusses current limitations and potential solutions in this emerging field.

Findings

01

Identified various multimodal LLM augmentation methods for images, text, and speech.

02

Discussed limitations in current approaches and proposed potential solutions.

03

Serves as a foundation for future research in multimodal data augmentation.

Abstract

In the past five years, research has shifted from traditional Machine Learning (ML) and Deep Learning (DL) approaches to leveraging Large Language Models (LLMs) , including multimodality, for data augmentation to enhance generalization, and combat overfitting in training deep convolutional neural networks. However, while existing surveys predominantly focus on ML and DL techniques or limited modalities (text or images), a gap remains in addressing the latest advancements and multi-modal applications of LLM-based methods. This survey fills that gap by exploring recent literature utilizing multimodal LLMs to augment image, text, and audio data, offering a comprehensive understanding of these processes. We outlined various methods employed in the LLM-based image, text and speech augmentation, and discussed the limitations identified in current approaches. Additionally, we identified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wsuagrobotics/data-aug-multi-modal-llm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling

MethodsFocus