B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

Yifan Wang; Sukrut Rao; Ji-Ung Lee; Mayank Jobanputra; Vera Demberg

arXiv:2502.12992·cs.CL·December 10, 2025

B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg

PDF

Open Access

TL;DR

This paper introduces B-cos LMs, a method to transform pre-trained language models into explainable models with improved faithfulness and interpretability, while maintaining performance, through B-cos conversion and fine-tuning.

Contribution

The work extends B-cos networks to NLP by transforming pre-trained language models into B-cos LMs, enhancing explainability without sacrificing task accuracy.

Findings

01

B-cos LMs produce more faithful explanations than post-hoc methods.

02

B-cos LMs maintain comparable task performance to traditional fine-tuning.

03

Transforming decoder-only models to B-cos LMs is feasible for generation tasks.

Abstract

Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural architectures. Meanwhile, B-cos networks have been introduced to improve model explainability by proposing an architecture that removes bias terms and promotes input-weight alignment. Although B-cos networks have shown success in building explainable systems, their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos Language Models (LMs) empowered for natural language processing (NLP) tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous methods. Automatic and human evaluation results demonstrate that B-cos…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Explainable Artificial Intelligence (XAI)

MethodsHigh-Order Consensuses