Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label Movie Trailer Genre Classification
Ricardo Montalvo-Lezama, Berenice Montalvo-Lezama, Gibran, Fuentes-Pineda

TL;DR
This paper evaluates the transferability of image and video models to multi-label movie trailer genre classification and introduces DIViTA, a dual transformer architecture that segments trailers to improve transfer performance.
Contribution
The paper presents DIViTA, a novel dual transformer architecture with shot detection, enhancing transfer learning from ImageNet and Kinetics to movie trailer genre classification.
Findings
Transferability of ImageNet and Kinetics models to Trailers12k is comparable.
Combining features from both datasets improves classification accuracy.
Lightweight ConvNets nearly match top Transformer performance with fewer resources.
Abstract
In this paper, we study the transferability of ImageNet spatial and Kinetics spatio-temporal representations to multi-label Movie Trailer Genre Classification (MTGC). In particular, we present an extensive evaluation of the transferability of ConvNet and Transformer models pretrained on ImageNet and Kinetics to Trailers12k, a new manually-curated movie trailer dataset composed of 12,000 videos labeled with 10 different genres and associated metadata. We analyze different aspects that can influence transferability, such as frame rate, input video extension, and spatio-temporal modeling. In order to reduce the spatio-temporal structure gap between ImageNet/Kinetics and Trailers12k, we propose Dual Image and Video Transformer Architecture (DIViTA), which performs shot detection so as to segment the trailer into highly correlated clips, providing a more cohesive input for pretrained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
MethodsAttention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Multi-Head Attention · Label Smoothing · Absolute Position Encodings · Layer Normalization
