# Show, Translate and Tell

**Authors:** Dheeraj Peri, Shagan Sah, Raymond Ptucha

arXiv: 1903.06275 · 2019-03-18

## TL;DR

This paper introduces a unified deep learning model that jointly trains on images and captions, enabling cross-modal retrieval, image captioning, and sentence paraphrasing, demonstrating strong generalization and competitive performance.

## Contribution

A novel unified model that simultaneously learns from images and captions, improving cross-modal understanding and versatility across multiple tasks.

## Key findings

- Model performs well on cross-modal retrieval tasks.
- Achieves competitive results in image captioning.
- Generalizes effectively across different multimodal tasks.

## Abstract

Humans have an incredible ability to process and understand information from multiple sources such as images, video, text, and speech. Recent success of deep neural networks has enabled us to develop algorithms which give machines the ability to understand and interpret this information. There is a need to both broaden their applicability and develop methods which correlate visual information along with semantic content. We propose a unified model which jointly trains on images and captions, and learns to generate new captions given either an image or a caption query. We evaluate our model on three different tasks namely cross-modal retrieval, image captioning, and sentence paraphrasing. Our model gains insight into cross-modal vector embeddings, generalizes well on multiple tasks and is competitive to state of the art methods on retrieval.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1903.06275/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/1903.06275/full.md

## References

21 references — full list in the complete paper: https://tomesphere.com/paper/1903.06275/full.md

---
Source: https://tomesphere.com/paper/1903.06275