# Multi-Agent Reinforcement Learning Based Frame Sampling for Effective   Untrimmed Video Recognition

**Authors:** Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, Shilei Wen

arXiv: 1907.13369 · 2019-08-05

## TL;DR

This paper introduces a multi-agent reinforcement learning approach for adaptive frame sampling in untrimmed video recognition, significantly improving accuracy and efficiency over traditional hand-crafted methods.

## Contribution

It develops a novel MARL framework with context-aware observation and policy networks for dynamic frame selection in untrimmed videos, outperforming existing strategies.

## Key findings

- Outperforms hand-crafted sampling strategies across various benchmarks.
- Achieves state-of-the-art results on YouTube Birds and YouTube Cars datasets.
- Comparable performance to top multi-modal fusion methods on ActivityNet v1.3.

## Abstract

Video Recognition has drawn great research interest and great progress has been made. A suitable frame sampling strategy can improve the accuracy and efficiency of recognition. However, mainstream solutions generally adopt hand-crafted frame sampling strategies for recognition. It could degrade the performance, especially in untrimmed videos, due to the variation of frame-level saliency. To this end, we concentrate on improving untrimmed video classification via developing a learning-based frame sampling strategy. We intuitively formulate the frame sampling procedure as multiple parallel Markov decision processes, each of which aims at picking out a frame/clip by gradually adjusting an initial sampling. Then we propose to solve the problems with multi-agent reinforcement learning (MARL). Our MARL framework is composed of a novel RNN-based context-aware observation network which jointly models context information among nearby agents and historical states of a specific agent, a policy network which generates the probability distribution over a predefined action space at each step and a classification network for reward calculation as well as final recognition. Extensive experimental results show that our MARL-based scheme remarkably outperforms hand-crafted strategies with various 2D and 3D baseline methods. Our single RGB model achieves a comparable performance of ActivityNet v1.3 champion submission with multi-modal multi-model fusion and new state-of-the-art results on YouTube Birds and YouTube Cars.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.13369/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/1907.13369/full.md

## References

56 references — full list in the complete paper: https://tomesphere.com/paper/1907.13369/full.md

---
Source: https://tomesphere.com/paper/1907.13369