When Better Eyes Lead to Blindness: A Diagnostic Study of the Information Bottleneck in CNN-LSTM Image Captioning Models
Hitesh Kumar Gupta

TL;DR
This paper systematically develops CNN-LSTM image captioning models, revealing that adding attention mechanisms is crucial for performance, and demonstrates the effectiveness of an advanced model trained on MS COCO with state-of-the-art results.
Contribution
It provides a detailed iterative development of CNN-LSTM models, highlighting the importance of attention mechanisms over backbone upgrades in image captioning.
Findings
Upgrading visual backbone alone can degrade performance.
Attention mechanisms are essential for transmitting visual detail.
Final model Nexus achieves BLEU-4 of 31.4 on MS COCO.
Abstract
Image captioning, situated at the intersection of computer vision and natural language processing, requires a sophisticated understanding of both visual scenes and linguistic structure. While modern approaches are dominated by large-scale Transformer architectures, this paper documents a systematic, iterative development of foundational image captioning models, progressing from a simple CNN-LSTM encoder-decoder to a competitive attention-based system. This paper presents a series of five models, beginning with Genesis and concluding with Nexus, an advanced model featuring an EfficientNetV2B3 backbone and a dynamic attention mechanism. The experiments chart the impact of architectural enhancements and demonstrate a key finding within the classic CNN-LSTM paradigm: merely upgrading the visual backbone without a corresponding attention mechanism can degrade performance, as the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
