Image Captioning with Deep Bidirectional LSTMs
Cheng Wang, Haojin Yang, Christian Bartz, Christoph Meinel

TL;DR
This paper introduces a deep bidirectional LSTM model for image captioning that effectively captures long-term visual-language interactions, achieving competitive results on standard benchmarks without additional mechanisms.
Contribution
The paper proposes novel deep bidirectional LSTM architectures and data augmentation techniques for improved image captioning and retrieval performance.
Findings
Achieves state-of-the-art results on caption generation tasks.
Significantly outperforms recent methods on image-sentence retrieval.
Demonstrates the effectiveness of bidirectional LSTMs in modeling visual-language relationships.
Abstract
This work presents an end-to-end trainable deep bidirectional LSTM (Long-Short Term Memory) model for image captioning. Our model builds on a deep convolutional neural network (CNN) and two separate LSTM networks. It is capable of learning long term visual-language interactions by making use of history and future context information at high level semantic space. Two novel deep bidirectional variant models, in which we increase the depth of nonlinearity transition in different way, are proposed to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale and vertical mirror are proposed to prevent overfitting in training deep models. We visualize the evolution of bidirectional LSTM internal states over time and qualitatively analyze how our models "translate" image to sentence. Our proposed models are evaluated on caption generation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
