Reading Is Believing: Revisiting Language Bottleneck Models for Image Classification
Honori Udo, Takafumi Koshinaka

TL;DR
This paper demonstrates that modern language bottleneck models, combining advanced image captioners with pre-trained language models, can outperform traditional black-box models in image classification accuracy, especially when fused together.
Contribution
It shows that recent large-scale vision-language models can overcome previous accuracy limitations of language bottleneck approaches in image classification.
Findings
Language bottleneck models can surpass black-box models in accuracy.
Fusing language bottleneck and black-box models yields higher accuracy.
Modern captioners enable effective explainability without sacrificing performance.
Abstract
We revisit language bottleneck models as an approach to ensuring the explainability of deep learning models for image classification. Because of inevitable information loss incurred in the step of converting images into language, the accuracy of language bottleneck models is considered to be inferior to that of standard black-box models. Recent image captioners based on large-scale foundation models of Vision and Language, however, have the ability to accurately describe images in verbal detail to a degree that was previously believed to not be realistically possible. In a task of disaster image classification, we experimentally show that a language bottleneck model that combines a modern image captioner with a pre-trained language model can achieve image classification accuracy that exceeds that of black-box models. We also demonstrate that a language bottleneck model and a black-box…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
