# Unsupervised Discovery of Multimodal Links in Multi-image,   Multi-sentence Documents

**Authors:** Jack Hessel, Lillian Lee, David Mimno

arXiv: 1904.07826 · 2019-09-04

## TL;DR

This paper introduces algorithms that automatically discover links between images and sentences within documents without needing explicit annotations, using a structured training approach across diverse datasets.

## Contribution

The work presents a novel unsupervised method for identifying image-sentence relationships in multimodal documents, applicable across various real-world datasets.

## Key findings

- Structured training based on co-occurrence effectively predicts image-sentence links.
- Method performs well across datasets with different characteristics.
- Unsupervised approach reduces reliance on annotated data.

## Abstract

Images and text co-occur constantly on the web, but explicit links between images and sentences (or other intra-document textual units) are often not present. We present algorithms that discover image-sentence relationships without relying on explicit multimodal annotation in training. We experiment on seven datasets of varying difficulty, ranging from documents consisting of groups of images captioned post hoc by crowdworkers to naturally-occurring user-generated multimodal documents. We find that a structured training objective based on identifying whether collections of images and sentences co-occur in documents can suffice to predict links between specific sentences and specific images within the same document at test time.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.07826/full.md

## Figures

18 figures with captions in the complete paper: https://tomesphere.com/paper/1904.07826/full.md

## References

63 references — full list in the complete paper: https://tomesphere.com/paper/1904.07826/full.md

---
Source: https://tomesphere.com/paper/1904.07826