Token-Level Marginalization for Multi-Label LLM Classifiers
Anjaneya Praharaj, Jaykumar Kasundra

TL;DR
This paper introduces token-level marginalization methods to improve confidence scoring and interpretability of generative language models in multi-label content safety classification, demonstrating significant performance gains.
Contribution
It proposes three novel token-level probability estimation techniques to enhance interpretability and accuracy of LLM classifiers for content safety tasks.
Findings
Token marginalization improves confidence score reliability.
Enhanced interpretability aids fine-grained error analysis.
Framework generalizes across instruction-tuned models.
Abstract
This paper addresses the critical challenge of deriving interpretable confidence scores from generative language models (LLMs) when applied to multi-label content safety classification. While models like LLaMA Guard are effective for identifying unsafe content and its categories, their generative architecture inherently lacks direct class-level probabilities, which hinders model confidence assessment and performance interpretation. This limitation complicates the setting of dynamic thresholds for content moderation and impedes fine-grained error analysis. This research proposes and evaluates three novel token-level probability estimation approaches to bridge this gap. The aim is to enhance model interpretability and accuracy, and evaluate the generalizability of this framework across different instruction-tuned models. Through extensive experimentation on a synthetically generated,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Spam and Phishing Detection
