Image Captioning as Neural Machine Translation Task in SOCKEYE
Loris Bazzani, Tobias Domhan, Felix Hieber

TL;DR
This paper explores the application of advanced neural machine translation decoders and attention mechanisms to improve image captioning, integrating these models into the SOCKEYE framework for enhanced performance.
Contribution
It introduces the use of neural machine translation decoders and attention models for image captioning within the SOCKEYE toolkit, bridging NLP and computer vision techniques.
Findings
Different decoders and attention models are evaluated for image captioning.
The models from neural machine translation improve captioning quality.
The implementation is available in the SOCKEYE toolkit.
Abstract
Image captioning is an interdisciplinary research problem that stands between computer vision and natural language processing. The task is to generate a textual description of the content of an image. The typical model used for image captioning is an encoder-decoder deep network, where the encoder captures the essence of an image while the decoder is responsible for generating a sentence describing the image. Attention mechanisms can be used to automatically focus the decoder on parts of the image which are relevant to predict the next word. In this paper, we explore different decoders and attentional models popular in neural machine translation, namely attentional recurrent neural networks, self-attentional transformers, and fully-convolutional networks, which represent the current state of the art of neural machine translation. The image captioning module is available as part of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
