# Recurrent Models for Situation Recognition

**Authors:** Arun Mallya, Svetlana Lazebnik

arXiv: 1703.06233 · 2017-08-07

## TL;DR

This paper introduces RNN-based models for predicting structured image situations, achieving state-of-the-art accuracy and demonstrating transferability to image captioning tasks involving human-object interactions.

## Contribution

It presents a novel RNN approach for situation recognition that outperforms previous CRF-based models and shows effective transfer to captioning tasks.

## Key findings

- Achieved state-of-the-art accuracy on imSitu dataset.
- RNN models outperform CRF-based approaches.
- Transferred features improve image captioning of human-object interactions.

## Abstract

This work proposes Recurrent Neural Network (RNN) models to predict structured 'image situations' -- actions and noun entities fulfilling semantic roles related to the action. In contrast to prior work relying on Conditional Random Fields (CRFs), we use a specialized action prediction network followed by an RNN for noun prediction. Our system obtains state-of-the-art accuracy on the challenging recent imSitu dataset, beating CRF-based models, including ones trained with additional data. Further, we show that specialized features learned from situation prediction can be transferred to the task of image captioning to more accurately describe human-object interactions.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1703.06233/full.md

## Figures

20 figures with captions in the complete paper: https://tomesphere.com/paper/1703.06233/full.md

## References

35 references — full list in the complete paper: https://tomesphere.com/paper/1703.06233/full.md

---
Source: https://tomesphere.com/paper/1703.06233