# Image Captioning with Clause-Focused Metrics in a Multi-Modal Setting   for Marketing

**Authors:** Philipp Harzig, Dan Zecha, Rainer Lienhart, Carolin Kaiser, Ren\'e, Schallner

arXiv: 1905.01919 · 2019-08-07

## TL;DR

This paper introduces a multi-task neural network approach for image captioning in marketing, emphasizing clause-focused evaluation metrics that capture semantic accuracy related to human-product interactions.

## Contribution

It proposes a novel clause-focused metric and a multi-task learning framework to generate more semantically accurate image captions in a marketing context.

## Key findings

- Improved caption quality with clause-focused metrics
- Effective multi-task neural network architecture
- Applicability of metrics to other datasets like MSCOCO

## Abstract

Automatically generating descriptive captions for images is a well-researched area in computer vision. However, existing evaluation approaches focus on measuring the similarity between two sentences disregarding fine-grained semantics of the captions. In our setting of images depicting persons interacting with branded products, the subject, predicate, object and the name of the branded product are important evaluation criteria of the generated captions. Generating image captions with these constraints is a new challenge, which we tackle in this work. By simultaneously predicting integer-valued ratings that describe attributes of the human-product interaction, we optimize a deep neural network architecture in a multi-task learning setting, which considerably improves the caption quality. Furthermore, we introduce a novel metric that allows us to assess whether the generated captions meet our requirements (i.e., subject, predicate, object, and product name) and describe a series of experiments on caption quality and how to address annotator disagreements for the image ratings with an approach called soft targets. We also show that our novel clause-focused metrics are also applicable to other image captioning datasets, such as the popular MSCOCO dataset.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.01919/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/1905.01919/full.md

## References

14 references — full list in the complete paper: https://tomesphere.com/paper/1905.01919/full.md

---
Source: https://tomesphere.com/paper/1905.01919