Label Forensics: Interpreting Hard Labels in Black-Box Text Classifier

Mengyao Du; Gang Yang; Han Fang; Quanjun Yin; Ee-chien Chang

arXiv:2512.01514·cs.LG·December 2, 2025

Label Forensics: Interpreting Hard Labels in Black-Box Text Classifier

Mengyao Du, Gang Yang, Han Fang, Quanjun Yin, Ee-chien Chang

PDF

Open Access

TL;DR

This paper introduces a framework called label forensics that infers the semantic meaning of hard-label outputs from black-box text classifiers, aiding interpretability and AI auditing.

Contribution

It proposes a novel method to reconstruct label semantics as a distribution of sentence embeddings, enabling interpretation of undocumented black-box classifiers.

Findings

01

Achieved around 92.24% label consistency in experiments

02

Successfully interpreted an undocumented HuggingFace classifier

03

Demonstrated the framework's effectiveness for AI auditing

Abstract

The widespread adoption of natural language processing techniques has led to an unprecedented growth of text classifiers across the modern web. Yet many of these models circulate with their internal semantics undocumented or even intentionally withheld. Such opaque classifiers, which may expose only hard-label outputs, can operate in unregulated web environments or be repurposed for unknown intents, raising legitimate forensic and auditing concerns. In this paper, we position ourselves as investigators and work to infer the semantic concept each label encodes in an undocumented black-box classifier. Specifically, we introduce label forensics, a black-box framework that reconstructs a label's semantic meaning. Concretely, we represent a label by a sentence embedding distribution from which any sample reliably reflects the concept the classifier has implicitly learned for that label. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text and Document Classification Technologies · Spam and Phishing Detection