A Multi-scale Multiple Instance Video Description Network

Huijuan Xu; Subhashini Venugopalan; Vasili Ramanishka; Marcus; Rohrbach; Kate Saenko

arXiv:1505.05914·cs.CV·March 22, 2016·46 cites

A Multi-scale Multiple Instance Video Description Network

Huijuan Xu, Subhashini Venugopalan, Vasili Ramanishka, Marcus, Rohrbach, Kate Saenko

PDF

Open Access

TL;DR

This paper introduces a novel multi-scale, multi-instance neural network architecture that enhances video description generation by effectively capturing multiple objects of varying sizes and locations within videos.

Contribution

It presents the first end-to-end trainable multi-scale, multi-instance network integrated with sequence-to-sequence models for improved video captioning.

Findings

01

Outperforms single-scale CNN models on Youtube video dataset

02

Efficiently handles multiple objects and scales in video frames

03

Demonstrates potential for extension to other video processing tasks

Abstract

Generating natural language descriptions for in-the-wild videos is a challenging task. Most state-of-the-art methods for solving this problem borrow existing deep convolutional neural network (CNN) architectures (AlexNet, GoogLeNet) to extract a visual representation of the input video. However, these deep CNN architectures are designed for single-label centered-positioned object classification. While they generate strong semantic features, they have no inherent structure allowing them to detect multiple objects of different sizes and locations in the frame. Our paper tries to solve this problem by integrating the base CNN into several fully convolutional neural networks (FCNs) to form a multi-scale network that handles multiple receptive field sizes in the original image. FCNs, previously applied to image segmentation, can generate class heat-maps efficiently compared to sliding window…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization