Are Large Language Models Good Data Preprocessors?

Elyas Meguellati; Nardiena Pratama; Shazia Sadiq; and Gianluca; Demartini

arXiv:2502.16790·cs.CL·February 25, 2025

Are Large Language Models Good Data Preprocessors?

Elyas Meguellati, Nardiena Pratama, Shazia Sadiq, and Gianluca, Demartini

PDF

Open Access

TL;DR

This paper evaluates the effectiveness of various large language models in cleaning and refining noisy image caption data, assessing their impact on downstream multimodal tasks.

Contribution

It provides an empirical comparison of multiple LLMs for data preprocessing, highlighting their potential and limitations in improving noisy textual data for complex tasks.

Findings

01

LLMs can improve data quality but with limited statistical significance

02

Effectiveness varies depending on dataset complexity and noise level

03

Further research needed to optimize LLMs for data cleaning

Abstract

High-quality textual training data is essential for the success of multimodal data processing tasks, yet outputs from image captioning models like BLIP and GIT often contain errors and anomalies that are difficult to rectify using rule-based methods. While recent work addressing this issue has predominantly focused on using GPT models for data preprocessing on relatively simple public datasets, there is a need to explore a broader range of Large Language Models (LLMs) and tackle more challenging and diverse datasets. In this study, we investigate the use of multiple LLMs, including LLaMA 3.1 70B, GPT-4 Turbo, and Sonnet 3.5 v2, to refine and clean the textual outputs of BLIP and GIT. We assess the impact of LLM-assisted data cleaning by comparing downstream-task (SemEval 2024 Subtask "Multilabel Persuasion Detection in Memes") models trained on cleaned versus non-cleaned data. While…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Distributed and Parallel Computing Systems

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Absolute Position Encodings · Linear Layer · Layer Normalization · Dense Connections · Attention Dropout · Residual Connection · Discriminative Fine-Tuning · Label Smoothing