# Zero-Shot Crowd Behavior Recognition

**Authors:** Xun Xu, Shaogang Gong, Timothy Hospedales

arXiv: 1908.05877 · 2019-08-19

## TL;DR

This paper introduces a zero-shot learning approach for recognizing multiple crowd behaviors in videos without training examples, leveraging cooccurrence data from text and annotated videos to improve detection of unseen behaviors.

## Contribution

It proposes a novel multiattribute cooccurrence modeling method for zero-shot crowd behavior recognition, addressing the challenge of recognizing unseen behaviors in videos.

## Key findings

- Improved zero-shot recognition accuracy on WWW crowd video dataset.
- Effective generalization to unseen behaviors like violence detection.
- Utilization of cross-attribute cooccurrence from text and annotations enhances performance.

## Abstract

Understanding crowd behavior in video is challenging for computer vision. There have been increasing attempts on modeling crowded scenes by introducing ever larger property ontologies (attributes) and annotating ever larger training datasets. However, in contrast to still images, manually annotating video attributes needs to consider spatiotemporal evolution which is inherently much harder and more costly. Critically, the most interesting crowd behaviors captured in surveillance videos (e.g., street fighting, flash mobs) are either rare, thus have few examples for model training, or unseen previously. Existing crowd analysis techniques are not readily scalable to recognize novel (unseen) crowd behaviors. To address this problem, we investigate and develop methods for recognizing visual crowd behavioral attributes without any training samples, i.e., zero-shot learning crowd behavior recognition. To that end, we relax the common assumption that each individual crowd video instance is only associated with a single crowd attribute. Instead, our model learns to jointly recognize multiple crowd behavioral attributes in each video instance by exploring multiattribute cooccurrence as contextual knowledge for optimizing individual crowd attribute recognition. Joint multilabel attribute prediction in zero-shot learning is inherently nontrivial because cooccurrence statistics does not exist for unseen attributes. To solve this problem, we learn to predict cross-attribute cooccurrence from both online text corpus and multilabel annotation of videos with known attributes. Our experiments show that this approach to modeling multiattribute context not only improves zero-shot crowd behavior recognition on the WWW crowd video dataset, but also generalizes to novel behavior (violence) detection cross-domain in the Violence Flow video dataset.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1908.05877/full.md

## Figures

22 figures with captions in the complete paper: https://tomesphere.com/paper/1908.05877/full.md

## References

58 references — full list in the complete paper: https://tomesphere.com/paper/1908.05877/full.md

---
Source: https://tomesphere.com/paper/1908.05877