# Harvesting Information from Captions for Weakly Supervised Semantic   Segmentation

**Authors:** Johann Sawatzky, Debayan Banerjee, Juergen Gall

arXiv: 1905.06784 · 2019-05-17

## TL;DR

This paper introduces a weakly supervised semantic segmentation method using internet-derived image captions, leveraging multi-modal learning to improve class activation maps and achieve state-of-the-art results on COCO.

## Contribution

It proposes using image captions as supervision, enabling the learning of textual context and compound concepts to enhance segmentation accuracy.

## Key findings

- Achieves state-of-the-art results on COCO dataset.
- Utilizes textual context from captions to improve class activation maps.
- Effectively models compound concepts for better segmentation.

## Abstract

Since acquiring pixel-wise annotations for training convolutional neural networks for semantic image segmentation is time-consuming, weakly supervised approaches that only require class tags have been proposed. In this work, we propose another form of supervision, namely image captions as they can be found on the Internet. These captions have two advantages. They do not require additional curation as it is the case for the clean class tags used by current weakly supervised approaches and they provide textual context for the classes present in an image. To leverage such textual context, we deploy a multi-modal network that learns a joint embedding of the visual representation of the image and the textual representation of the caption. The network estimates text activation maps (TAMs) for class names as well as compound concepts, i.e. combinations of nouns and their attributes. The TAMs of compound concepts describing classes of interest substantially improve the quality of the estimated class activation maps which are then used to train a network for semantic segmentation. We evaluate our method on the COCO dataset where it achieves state of the art results for weakly supervised image segmentation.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.06784/full.md

## Figures

22 figures with captions in the complete paper: https://tomesphere.com/paper/1905.06784/full.md

## References

52 references — full list in the complete paper: https://tomesphere.com/paper/1905.06784/full.md

---
Source: https://tomesphere.com/paper/1905.06784