Hybrid Attribution Priors for Explainable and Robust Model Training

Zhuoran Zhang; Feng Zhang; Shangyuan Li; Yang Shi; Yuanxing Zhang; Wei Chen; Tengjiao Wang; Kam-Fai Wong

arXiv:2512.14719·cs.LG·December 18, 2025

Hybrid Attribution Priors for Explainable and Robust Model Training

Zhuoran Zhang, Feng Zhang, Shangyuan Li, Yang Shi, Yuanxing Zhang, Wei Chen, Tengjiao Wang, Kam-Fai Wong

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a novel attribution prior framework, CAP, and its hybrid version, CAP Hybrid, to improve the interpretability and robustness of small language models by guiding them to focus on fine-grained class distinctions.

Contribution

The paper proposes the Class-Aware Attribution Prior (CAP) and CAP Hybrid, novel methods that enhance model differentiation and robustness by leveraging enriched attribution priors.

Findings

01

Improves interpretability of language models.

02

Enhances robustness against adversarial attacks.

03

Effective across full-data and few-shot scenarios.

Abstract

Small language models (SLMs) are widely used in tasks that require low latency and lightweight deployment, particularly classification. As interpretability and robustness gain increasing importance, explanation-guided learning has emerged as an effective framework by introducing attribution-based supervision during training; however, deriving general and reliable attribution priors remains a significant challenge. Through an analysis of representative attribution methods in classification settings, we find that although these methods can reliably highlight class-relevant tokens, they often focus on common keywords shared by semantically similar classes. Because such classes are already difficult to distinguish under standard training, these attributions provide insufficient discriminative cues, limiting their ability to improve model differentiation. To overcome this limitation, we…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

* Originality: The paper highlights homogenization and class confusion in methods such as LIME and SHAP when applied to language model predictions. The efficient Cholesky factorization solution to the attribution optimization problem (Equations 1-4) demonstrates technical innovation and can potentially be adopted to speed up future perturbation-based attribution techniques. * Quality: The evaluation focuses on the real-world task of intent classification, employing three diverse datasets, althou

Weaknesses

In the related work section, authors correctly identify "[...] the elicitation of more reliable explanations as a key pathway to enhance interpretative robustness". However, the investigation focus solely on methods such as SHAP, LIME and IG that were developed long before present-day language models, and importantly not specifically attuned to the language domain. Much research was conducted on the development of new methods that would reflect input contributions more faithfully in these domain

Reviewer 02Rating 2Confidence 4

Strengths

The paper targets a practical goal: making small language models both accurate and interpretable. The work provides a clear way to get discriminative supervision without manual rationales.

Weaknesses

The fundamental issue is that the paper’s core prior (CAP) is built by repeatedly querying a large language model with masked versions of each training example, then fitting a regression to infer per-word importance. This seems extremely expensive at realistic dataset sizes, especially for long texts. On top of that, CAPHybrid, which is the best performing method, fuses multiple attribution sources (CAP, LIME, IG), some of which are themselves multi-pass methods. There are claims of broad robus

Reviewer 03Rating 2Confidence 5

Strengths

- The paper is generally well-written and structured, guiding the reader from problem motivation to method formulation and evaluation. - The paper provides both quantitative and qualitative analyses, as well as an adversarial evaluation protocol to assess robustness. - The paper evaluates CAPHybrid under diverse conditions including full-data, few-shot, and adversarial settings.

Weaknesses

- The paper argues that existing attribution methods such as LIME and IG fail to produce class-aware or discriminative explanations, but it does not compare against more advanced interpretability methods that have explicitly addressed this limitation. For instance, transformer-based attribution approaches [1,2,3] provide **stronger baselines** for evaluating “class-aware” interpretability. Without these comparisons, it remains unclear whether CAPHybrid truly advances beyond the state of the art

Reviewer 04Rating 2Confidence 4

Strengths

1. The proposed LLM-based feature attribution method is novel and offers a promising direction for future research and exploration. 2. The hybrid approach is conceptually sound and empirically validated, demonstrating that LLM-based feature attribution provides complementary benefits to traditional SLM-based methods.

Weaknesses

1. The aggregation mechanism could be further investigated. In the simulation study, the authors did not specify which aggregation method (mean or max) was used. Including an ablation study comparing different aggregation strategies would make the analysis more comprehensive. 2. The formulation of the LLM attribution priors (both in Equation (1) and the subsequent algorithm, ranging lines 254–276) is equivalent to Ridge regression. The authors may consider citing relevant references and leveragi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Multimodal Machine Learning Applications