DoLFIn: Distributions over Latent Features for Interpretability
Phong Le, Willem Zuidema

TL;DR
DoLFIn introduces a novel interpretability method for neural networks that uses distributions over latent features, enabling straightforward explanations and slightly improved performance in text classification tasks.
Contribution
The paper presents DoLFIn, a new architecture that models features as an unordered set with associated probabilities, enhancing interpretability without sacrificing model performance.
Findings
DoLFIn provides clear probability-based explanations for model decisions.
It slightly outperforms classical CNN and BiLSTM models on SST2 and AG-news datasets.
The approach maintains interpretability while achieving competitive accuracy.
Abstract
Interpreting the inner workings of neural models is a key step in ensuring the robustness and trustworthiness of the models, but work on neural network interpretability typically faces a trade-off: either the models are too constrained to be very useful, or the solutions found by the models are too complex to interpret. We propose a novel strategy for achieving interpretability that -- in our experiments -- avoids this trade-off. Our approach builds on the success of using probability as the central quantity, such as for instance within the attention mechanism. In our architecture, DoLFIn (Distributions over Latent Features for Interpretability), we do no determine beforehand what each feature represents, and features go altogether into an unordered set. Each feature has an associated probability ranging from 0 to 1, weighing its importance for further processing. We show that, unlike…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Topic Modeling
MethodsInterpretability · Tanh Activation · Sigmoid Activation · Long Short-Term Memory · Bidirectional LSTM
