Prompting in the Dark: Assessing Human Performance in Prompt Engineering for Data Labeling When Gold Labels Are Absent

Zeyu He; Saniya Naphade; Ting-Hao 'Kenneth' Huang

arXiv:2502.11267·cs.HC·September 3, 2025

Prompting in the Dark: Assessing Human Performance in Prompt Engineering for Data Labeling When Gold Labels Are Absent

Zeyu He, Saniya Naphade, Ting-Hao 'Kenneth' Huang

PDF

TL;DR

This study examines how effectively humans can improve data labeling with large language models through iterative prompting without access to gold-standard labels, revealing significant unreliability and challenges for automated tools.

Contribution

It introduces PromptingSheet, a tool for iterative prompt-based data labeling without gold labels, and provides empirical insights into human performance and tool limitations in this setting.

Findings

01

Few participants improved labeling accuracy after multiple iterations.

02

Automated prompt optimization tools perform poorly without gold labels.

03

Highlighting the critical role of gold labels in effective prompt engineering.

Abstract

Millions of users prompt large language models (LLMs) for various tasks, but how good are people at prompt engineering? Do users actually get closer to their desired outcome over multiple iterations of their prompts? These questions are crucial when no gold-standard labels are available to measure progress. This paper investigates a scenario in LLM-powered data labeling, "prompting in the dark," where users iteratively prompt LLMs to label data without using manually-labeled benchmarks. We developed PromptingSheet, a Google Sheets add-on that enables users to compose, revise, and iteratively label data through spreadsheets. Through a study with 20 participants, we found that prompting in the dark was highly unreliable -- only 9 participants improved labeling accuracy after four or more iterations. Automated prompt optimization tools like DSPy also struggled when few gold labels were…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.