Event Recognition with Automatic Album Detection based on Sequential Processing, Neural Attention and Image Captioning
Andrey V. Savchenko

TL;DR
This paper introduces a novel two-stage event recognition method that groups photos into albums using sequential features and neural attention, enhanced by image captioning, achieving higher accuracy than traditional approaches.
Contribution
The paper proposes a new approach combining sequential clustering, neural attention, and image captioning for event recognition in unlabeled photo albums, outperforming existing methods.
Findings
Achieves 9-20% higher accuracy than single-photo event recognition.
Reduces error rate by 13-16% compared to hierarchical clustering.
Image captions trained on Conceptual Captions improve classification accuracy.
Abstract
In this paper a new formulation of event recognition task is examined: it is required to predict event categories in a gallery of images, for which albums (groups of photos corresponding to a single event) are unknown. We propose the novel two-stage approach. At first, features are extracted in each photo using the pre-trained convolutional neural network. These features are classified individually. The scores of the classifier are used to group sequential photos into several clusters. Finally, the features of photos in each group are aggregated into a single descriptor using neural attention mechanism. This algorithm is optionally extended to improve the accuracy for classification of each image in an album. In contrast to conventional fine-tuning of convolutional neural networks (CNN) we proposed to use image captioning, i.e., generative model that converts images to textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
