# Integrating Text and Image: Determining Multimodal Document Intent in   Instagram Posts

**Authors:** Julia Kruk, Jonah Lubin, Karan Sikka, Xiao Lin, Dan Jurafsky, Ajay, Divakaran

arXiv: 1904.09073 · 2019-11-11

## TL;DR

This paper introduces a new multimodal dataset of Instagram posts with intent labels and demonstrates that combining text and image improves understanding of authorial and semiotic relationships, revealing complex meaning interactions.

## Contribution

The paper presents a novel dataset and a baseline multimodal classifier for understanding intent in Instagram posts, highlighting the importance of multimodal analysis.

## Key findings

- Multimodal approach improves intent detection by 9.6%.
- Employing both text and image enhances understanding of semiotic divergence.
- Dataset enables studying complex meaning interactions in social media.

## Abstract

Computing author intent from multimodal data like Instagram posts requires modeling a complex relationship between text and image. For example, a caption might evoke an ironic contrast with the image, so neither caption nor image is a mere transcript of the other. Instead they combine -- via what has been called meaning multiplication -- to create a new meaning that has a more complex relation to the literal meanings of text and image. Here we introduce a multimodal dataset of 1299 Instagram posts labeled for three orthogonal taxonomies: the authorial intent behind the image-caption pair, the contextual relationship between the literal meanings of the image and caption, and the semiotic relationship between the signified meanings of the image and caption. We build a baseline deep multimodal classifier to validate the taxonomy, showing that employing both text and image improves intent detection by 9.6% compared to using only the image modality, demonstrating the commonality of non-intersective meaning multiplication. The gain with multimodality is greatest when the image and caption diverge semiotically. Our dataset offers a new resource for the study of the rich meanings that result from pairing text and image.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.09073/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/1904.09073/full.md

## References

39 references — full list in the complete paper: https://tomesphere.com/paper/1904.09073/full.md

---
Source: https://tomesphere.com/paper/1904.09073