Token-Level Marginalization for Multi-Label LLM Classifiers

Anjaneya Praharaj; Jaykumar Kasundra

arXiv:2511.22312·cs.CL·December 1, 2025

Token-Level Marginalization for Multi-Label LLM Classifiers

Anjaneya Praharaj, Jaykumar Kasundra

PDF

Open Access 1 Datasets

TL;DR

This paper introduces token-level marginalization methods to improve confidence scoring and interpretability of generative language models in multi-label content safety classification, demonstrating significant performance gains.

Contribution

It proposes three novel token-level probability estimation techniques to enhance interpretability and accuracy of LLM classifiers for content safety tasks.

Findings

01

Token marginalization improves confidence score reliability.

02

Enhanced interpretability aids fine-grained error analysis.

03

Framework generalizes across instruction-tuned models.

Abstract

This paper addresses the critical challenge of deriving interpretable confidence scores from generative language models (LLMs) when applied to multi-label content safety classification. While models like LLaMA Guard are effective for identifying unsafe content and its categories, their generative architecture inherently lacks direct class-level probabilities, which hinders model confidence assessment and performance interpretation. This limitation complicates the setting of dynamic thresholds for content moderation and impedes fine-grained error analysis. This research proposes and evaluates three novel token-level probability estimation approaches to bridge this gap. The aim is to enhance model interpretability and accuracy, and evaluate the generalizability of this framework across different instruction-tuned models. Through extensive experimentation on a synthetically generated,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AuroraQuantum/llama-guard-safety-eval
dataset· 44 dl
44 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Spam and Phishing Detection