New Faithfulness-Centric Interpretability Paradigms for Natural Language   Processing

Andreas Madsen

arXiv:2411.17992·cs.CL·November 28, 2024

New Faithfulness-Centric Interpretability Paradigms for Natural Language Processing

Andreas Madsen

PDF

Open Access

TL;DR

This paper introduces new paradigms for faithful interpretability in NLP models, focusing on developing metrics and models like FMMs and self-explanations to improve explanation faithfulness and consistency.

Contribution

It proposes the development of faithfulness measurable models and self-explanations, providing new paradigms that enhance explanation faithfulness in neural NLP models.

Findings

01

FMMs produce near-optimal faithfulness explanations.

02

Post-hoc explanations are model and task-dependent.

03

Simple model modifications can drastically improve explanation faithfulness.

Abstract

As machine learning becomes more widespread and is used in more critical applications, it's important to provide explanations for these models, to prevent unintended behavior. Unfortunately, many current interpretability methods struggle with faithfulness. Therefore, this Ph.D. thesis investigates the question "How to provide and ensure faithful explanations for complex general-purpose neural NLP models?" The main thesis is that we should develop new paradigms in interpretability. This is achieved by first developing solid faithfulness metrics and then applying the lessons learned from this investigation to develop new paradigms. The two new paradigms explored are faithfulness measurable models (FMMs) and self-explanations. The idea in self-explanations is to have large language models explain themselves, we identify that current models are not capable of doing this consistently.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques