Effective Proxy for Human Labeling: Ensemble Disagreement Scores in   Large Language Models for Industrial NLP

Wei Du; Laksh Advani; Yashmeet Gambhir; Daniel J Perry; Prashant; Shiralkar; Zhengzheng Xing; and Aaron Colak

arXiv:2309.05619·cs.CL·November 21, 2023

Effective Proxy for Human Labeling: Ensemble Disagreement Scores in Large Language Models for Industrial NLP

Wei Du, Laksh Advani, Yashmeet Gambhir, Daniel J Perry, Prashant, Shiralkar, Zhengzheng Xing, and Aaron Colak

PDF

Open Access

TL;DR

This paper shows that ensemble disagreement scores can effectively estimate LLM performance in industrial NLP tasks, reducing the need for costly human labeling and outperforming other proxy methods across multiple languages and domains.

Contribution

It introduces ensemble disagreement scores as a reliable proxy for human labeling in assessing LLM performance in real-world industrial NLP applications.

Findings

01

Disagreement scores closely match human error measurements with MAE as low as 0.4%.

02

Disagreement scores outperform silver labels by an average of 13.8%.

03

Effective across multiple languages and domains.

Abstract

Large language models (LLMs) have demonstrated significant capability to generalize across a large number of NLP tasks. For industry applications, it is imperative to assess the performance of the LLM on unlabeled production data from time to time to validate for a real-world setting. Human labeling to assess model error requires considerable expense and time delay. Here we demonstrate that ensemble disagreement scores work well as a proxy for human labeling for language models in zero-shot, few-shot, and fine-tuned settings, per our evaluation on keyphrase extraction (KPE) task. We measure fidelity of the results by comparing to true error measured from human labeled ground truth. We contrast with the alternative of using another LLM as a source of machine labels, or silver labels. Results across various languages and domains show disagreement scores provide a better estimation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Topic Modeling · Natural Language Processing Techniques