Multimodal Large Language Models for Image, Text, and Speech Data Augmentation: A Survey
Ranjan Sapkota, Shaina Raza, Maged Shoman, Achyut Paudel, Manoj Karkee

TL;DR
This survey reviews recent advancements in multimodal large language models used for data augmentation across image, text, and speech modalities, highlighting methods, limitations, and future directions.
Contribution
It provides a comprehensive overview of multimodal LLM-based data augmentation techniques and discusses current limitations and potential solutions in this emerging field.
Findings
Identified various multimodal LLM augmentation methods for images, text, and speech.
Discussed limitations in current approaches and proposed potential solutions.
Serves as a foundation for future research in multimodal data augmentation.
Abstract
In the past five years, research has shifted from traditional Machine Learning (ML) and Deep Learning (DL) approaches to leveraging Large Language Models (LLMs) , including multimodality, for data augmentation to enhance generalization, and combat overfitting in training deep convolutional neural networks. However, while existing surveys predominantly focus on ML and DL techniques or limited modalities (text or images), a gap remains in addressing the latest advancements and multi-modal applications of LLM-based methods. This survey fills that gap by exploring recent literature utilizing multimodal LLMs to augment image, text, and audio data, offering a comprehensive understanding of these processes. We outlined various methods employed in the LLM-based image, text and speech augmentation, and discussed the limitations identified in current approaches. Additionally, we identified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
MethodsFocus
