DeViL: Decoding Vision features into Language
Meghal Dani, Isabel Rio-Torto, Stephan Alaniz, Zeynep Akata

TL;DR
DeViL decodes vision features into natural language descriptions at different network layers, providing interpretable, localized explanations of vision models using a transformer-based approach that generalizes across vision backbones.
Contribution
Introduces DeViL, a method that translates vision features into language for layer-wise interpretability, leveraging a transformer and pre-trained language model for fast, open-vocabulary explanations.
Findings
Outperforms previous captioning models on CC3M.
Generates relevant textual descriptions for vision features.
Achieves state-of-the-art neuron-wise explanations on MILANNOTATIONS.
Abstract
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks. In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned. Our DeViL method decodes vision features into language, not only highlighting the attribution locations but also generating textual descriptions of visual features at different layers of the network. We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language. By employing dropout both per-layer and per-spatial-location, our model can generalize training on image-text pairs to generate localized explanations. As it uses a pre-trained language model, our approach is fast to train, can be applied to any vision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsDropout
