Label Forensics: Interpreting Hard Labels in Black-Box Text Classifier
Mengyao Du, Gang Yang, Han Fang, Quanjun Yin, Ee-chien Chang

TL;DR
This paper introduces a framework called label forensics that infers the semantic meaning of hard-label outputs from black-box text classifiers, aiding interpretability and AI auditing.
Contribution
It proposes a novel method to reconstruct label semantics as a distribution of sentence embeddings, enabling interpretation of undocumented black-box classifiers.
Findings
Achieved around 92.24% label consistency in experiments
Successfully interpreted an undocumented HuggingFace classifier
Demonstrated the framework's effectiveness for AI auditing
Abstract
The widespread adoption of natural language processing techniques has led to an unprecedented growth of text classifiers across the modern web. Yet many of these models circulate with their internal semantics undocumented or even intentionally withheld. Such opaque classifiers, which may expose only hard-label outputs, can operate in unregulated web environments or be repurposed for unknown intents, raising legitimate forensic and auditing concerns. In this paper, we position ourselves as investigators and work to infer the semantic concept each label encodes in an undocumented black-box classifier. Specifically, we introduce label forensics, a black-box framework that reconstructs a label's semantic meaning. Concretely, we represent a label by a sentence embedding distribution from which any sample reliably reflects the concept the classifier has implicitly learned for that label. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies · Spam and Phishing Detection
