Robust Explanations for User Trust in Enterprise NLP Systems
Guilin Zhang, Kai Zhao, Jeffrey Friedman, Xu Chu, Amine Anoun, Jerry Ting

TL;DR
This paper introduces a framework for evaluating the robustness of token-level explanations in enterprise NLP, revealing that decoder LLMs offer more stable explanations than encoder models, with stability improving as models scale.
Contribution
It proposes a unified black-box robustness evaluation protocol and provides a systematic comparison across multiple models and datasets, highlighting the robustness advantages of decoder LLMs.
Findings
Decoder LLMs produce 73% lower flip rates than encoder models.
Explanation stability improves by 44% when scaling from 7B to 70B parameters.
A cost-robustness tradeoff curve guides model and explanation selection.
Abstract
Robust explanations are increasingly required for user trust in enterprise NLP, yet pre-deployment validation is difficult in the common case of black-box deployment (API-only access) where representation-based explainers are infeasible and existing studies provide limited guidance on whether explanations remain stable under real user noise, especially when organizations migrate from encoder classifiers to decoder LLMs. To close this gap, we propose a unified black-box robustness evaluation framework for token-level explanations based on leave-one-out occlusion, and operationalize explanation robustness with top-token flip rate under realistic perturbations (swap, deletion, shuffling, and back-translation) at multiple severity levels. Using this protocol, we conduct a systematic cross-architecture comparison across three benchmark datasets and six models spanning encoder and decoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
