Are Large Language Models Good Data Preprocessors?
Elyas Meguellati, Nardiena Pratama, Shazia Sadiq, and Gianluca, Demartini

TL;DR
This paper evaluates the effectiveness of various large language models in cleaning and refining noisy image caption data, assessing their impact on downstream multimodal tasks.
Contribution
It provides an empirical comparison of multiple LLMs for data preprocessing, highlighting their potential and limitations in improving noisy textual data for complex tasks.
Findings
LLMs can improve data quality but with limited statistical significance
Effectiveness varies depending on dataset complexity and noise level
Further research needed to optimize LLMs for data cleaning
Abstract
High-quality textual training data is essential for the success of multimodal data processing tasks, yet outputs from image captioning models like BLIP and GIT often contain errors and anomalies that are difficult to rectify using rule-based methods. While recent work addressing this issue has predominantly focused on using GPT models for data preprocessing on relatively simple public datasets, there is a need to explore a broader range of Large Language Models (LLMs) and tackle more challenging and diverse datasets. In this study, we investigate the use of multiple LLMs, including LLaMA 3.1 70B, GPT-4 Turbo, and Sonnet 3.5 v2, to refine and clean the textual outputs of BLIP and GIT. We assess the impact of LLM-assisted data cleaning by comparing downstream-task (SemEval 2024 Subtask "Multilabel Persuasion Detection in Memes") models trained on cleaned versus non-cleaned data. While…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Distributed and Parallel Computing Systems
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Absolute Position Encodings · Linear Layer · Layer Normalization · Dense Connections · Attention Dropout · Residual Connection · Discriminative Fine-Tuning · Label Smoothing
