With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning
Manuele Barraco, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita, Cucchiara

TL;DR
This paper introduces a prototypical memory network for image captioning that enhances Transformer models by incorporating semantic information from other samples, leading to improved performance on the COCO dataset.
Contribution
The paper proposes a novel prototypical memory mechanism for attention in image captioning, capturing semantic information from multiple samples to boost Transformer performance.
Findings
Achieved a 3.7 CIDEr points improvement on COCO dataset
Enhanced Transformer-based captioning with sample-aware attention
Demonstrated effectiveness of prototype-based memory in vision-language tasks
Abstract
Image captioning, like many tasks involving vision and language, currently relies on Transformer-based architectures for extracting the semantics in an image and translating it into linguistically coherent descriptions. Although successful, the attention operator only considers a weighted summation of projections of the current input sample, therefore ignoring the relevant semantic information which can come from the joint observation of other samples. In this paper, we devise a network which can perform attention over activations obtained while processing other training samples, through a prototypical memory model. Our memory models the distribution of past keys and values through the definition of prototype vectors which are both discriminative and compact. Experimentally, we assess the performance of the proposed model on the COCO dataset, in comparison with carefully designed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
With a Little Help from Your Own Past: Prototypical Memory Networks for Image Captioning· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dropout · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Dense Connections
