# Video Action Transformer Network

**Authors:** Rohit Girdhar, Jo\~ao Carreira, Carl Doersch, Andrew Zisserman

arXiv: 1812.02707 · 2019-05-20

## TL;DR

The paper presents the Action Transformer, a novel model that uses a Transformer architecture to recognize and localize human actions in videos by leveraging spatiotemporal context and attention mechanisms, achieving state-of-the-art results.

## Contribution

It introduces a Transformer-based approach for video action recognition that learns to track individuals and focus on key features without explicit supervision.

## Key findings

- Outperforms previous methods on AVA dataset
- Learns to track individuals and focus on faces and hands
- Uses only raw RGB frames for training

## Abstract

We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action - all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1812.02707/full.md

## Figures

14 figures with captions in the complete paper: https://tomesphere.com/paper/1812.02707/full.md

## References

52 references — full list in the complete paper: https://tomesphere.com/paper/1812.02707/full.md

---
Source: https://tomesphere.com/paper/1812.02707