Does Prompt Design Impact Quality of Data Imputation by LLMs?
Shreenidhi Srinivasan, Lydia Manikonda

TL;DR
This paper investigates how prompt design influences the quality of data imputation using large language models, proposing a novel token-aware method that improves imputation for class-imbalanced datasets while reducing prompt size.
Contribution
It introduces a structured prompting technique and demonstrates its effectiveness in enhancing LLM-based data imputation for imbalanced datasets, addressing a key gap in the field.
Findings
Significantly reduces input prompt size
Maintains or improves imputation quality
Effective for small, class-imbalanced datasets
Abstract
Generating realistic synthetic tabular data presents a critical challenge in machine learning. It adds another layer of complexity when this data contain class imbalance problems. This paper presents a novel token-aware data imputation method that leverages the in-context learning capabilities of large language models. This is achieved through the combination of a structured group-wise CSV-style prompting technique and the elimination of irrelevant contextual information in the input prompt. We test this approach with two class-imbalanced binary classification datasets and evaluate the effectiveness of imputation using classification-based evaluation metrics. The experimental results demonstrate that our approach significantly reduces the input prompt size while maintaining or improving imputation quality compared to our baseline prompt, especially for datasets that are of relatively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Machine Learning and Algorithms · Machine Learning and Data Classification
