New Faithfulness-Centric Interpretability Paradigms for Natural Language Processing
Andreas Madsen

TL;DR
This paper introduces new paradigms for faithful interpretability in NLP models, focusing on developing metrics and models like FMMs and self-explanations to improve explanation faithfulness and consistency.
Contribution
It proposes the development of faithfulness measurable models and self-explanations, providing new paradigms that enhance explanation faithfulness in neural NLP models.
Findings
FMMs produce near-optimal faithfulness explanations.
Post-hoc explanations are model and task-dependent.
Simple model modifications can drastically improve explanation faithfulness.
Abstract
As machine learning becomes more widespread and is used in more critical applications, it's important to provide explanations for these models, to prevent unintended behavior. Unfortunately, many current interpretability methods struggle with faithfulness. Therefore, this Ph.D. thesis investigates the question "How to provide and ensure faithful explanations for complex general-purpose neural NLP models?" The main thesis is that we should develop new paradigms in interpretability. This is achieved by first developing solid faithfulness metrics and then applying the lessons learned from this investigation to develop new paradigms. The two new paradigms explored are faithfulness measurable models (FMMs) and self-explanations. The idea in self-explanations is to have large language models explain themselves, we identify that current models are not capable of doing this consistently.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
