A Multi-scale Multiple Instance Video Description Network
Huijuan Xu, Subhashini Venugopalan, Vasili Ramanishka, Marcus, Rohrbach, Kate Saenko

TL;DR
This paper introduces a novel multi-scale, multi-instance neural network architecture that enhances video description generation by effectively capturing multiple objects of varying sizes and locations within videos.
Contribution
It presents the first end-to-end trainable multi-scale, multi-instance network integrated with sequence-to-sequence models for improved video captioning.
Findings
Outperforms single-scale CNN models on Youtube video dataset
Efficiently handles multiple objects and scales in video frames
Demonstrates potential for extension to other video processing tasks
Abstract
Generating natural language descriptions for in-the-wild videos is a challenging task. Most state-of-the-art methods for solving this problem borrow existing deep convolutional neural network (CNN) architectures (AlexNet, GoogLeNet) to extract a visual representation of the input video. However, these deep CNN architectures are designed for single-label centered-positioned object classification. While they generate strong semantic features, they have no inherent structure allowing them to detect multiple objects of different sizes and locations in the frame. Our paper tries to solve this problem by integrating the base CNN into several fully convolutional neural networks (FCNs) to form a multi-scale network that handles multiple receptive field sizes in the original image. FCNs, previously applied to image segmentation, can generate class heat-maps efficiently compared to sliding window…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
