# Referring to Objects in Videos using Spatio-Temporal Identifying   Descriptions

**Authors:** Peratham Wiriyathammabhum, Abhinav Shrivastava, Vlad I. Morariu, Larry, S. Davis

arXiv: 1904.03885 · 2019-04-09

## TL;DR

This paper introduces a new task of grounding spatio-temporal descriptions in videos, proposing a novel dataset creation scheme and a two-stream attention network that leverages appearance and motion for improved grounding accuracy.

## Contribution

It presents a new data collection method based on grammatical constraints and a modular neural network architecture for better grounding of descriptions in videos.

## Key findings

- Motion modules improve grounding of motion-related words.
- Modular networks reduce task interference between appearance and motion modules.
- The approach highlights the importance of automatic detection for real-world applications.

## Abstract

This paper presents a new task, the grounding of spatio-temporal identifying descriptions in videos. Previous work suggests potential bias in existing datasets and emphasizes the need for a new data creation schema to better model linguistic structure. We introduce a new data collection scheme based on grammatical constraints for surface realization to enable us to investigate the problem of grounding spatio-temporal identifying descriptions in videos. We then propose a two-stream modular attention network that learns and grounds spatio-temporal identifying descriptions based on appearance and motion. We show that motion modules help to ground motion-related words and also help to learn in appearance modules because modular neural networks resolve task interference between modules. Finally, we propose a future challenge and a need for a robust system arising from replacing ground truth visual annotations with automatic video object detector and temporal event localization.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.03885/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/1904.03885/full.md

## References

53 references — full list in the complete paper: https://tomesphere.com/paper/1904.03885/full.md

---
Source: https://tomesphere.com/paper/1904.03885