# An Analysis of Action Recognition Datasets for Language and Vision Tasks

**Authors:** Spandana Gella, Frank Keller

arXiv: 1704.07129 · 2017-04-25

## TL;DR

This paper surveys action recognition datasets that integrate language and vision, analyzing their diversity, strengths, and weaknesses to inform future research in multimodal understanding.

## Contribution

It categorizes existing approaches and provides a detailed review of datasets linking visual actions with linguistic resources, highlighting their advantages and limitations.

## Key findings

- Datasets vary in their conceptualization of action recognition.
- Recent datasets offer fine-grained syntactic and semantic analysis.
- Diversity in datasets supports a range of applications in language-vision tasks.

## Abstract

A large amount of recent research has focused on tasks that combine language and vision, resulting in a proliferation of datasets and methods. One such task is action recognition, whose applications include image annotation, scene under- standing and image retrieval. In this survey, we categorize the existing ap- proaches based on how they conceptualize this problem and provide a detailed review of existing datasets, highlighting their di- versity as well as advantages and disad- vantages. We focus on recently devel- oped datasets which link visual informa- tion with linguistic resources and provide a fine-grained syntactic and semantic anal- ysis of actions in images.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1704.07129/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/1704.07129/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/1704.07129/full.md

---
Source: https://tomesphere.com/paper/1704.07129