Improving Transfer Learning with a Dual Image and Video Transformer for   Multi-label Movie Trailer Genre Classification

Ricardo Montalvo-Lezama; Berenice Montalvo-Lezama; Gibran; Fuentes-Pineda

arXiv:2210.07983·cs.CV·March 30, 2023

Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label Movie Trailer Genre Classification

Ricardo Montalvo-Lezama, Berenice Montalvo-Lezama, Gibran, Fuentes-Pineda

PDF

Open Access 1 Repo

TL;DR

This paper evaluates the transferability of image and video models to multi-label movie trailer genre classification and introduces DIViTA, a dual transformer architecture that segments trailers to improve transfer performance.

Contribution

The paper presents DIViTA, a novel dual transformer architecture with shot detection, enhancing transfer learning from ImageNet and Kinetics to movie trailer genre classification.

Findings

01

Transferability of ImageNet and Kinetics models to Trailers12k is comparable.

02

Combining features from both datasets improves classification accuracy.

03

Lightweight ConvNets nearly match top Transformer performance with fewer resources.

Abstract

In this paper, we study the transferability of ImageNet spatial and Kinetics spatio-temporal representations to multi-label Movie Trailer Genre Classification (MTGC). In particular, we present an extensive evaluation of the transferability of ConvNet and Transformer models pretrained on ImageNet and Kinetics to Trailers12k, a new manually-curated movie trailer dataset composed of 12,000 videos labeled with 10 different genres and associated metadata. We analyze different aspects that can influence transferability, such as frame rate, input video extension, and spatio-temporal modeling. In order to reduce the spatio-temporal structure gap between ImageNet/Kinetics and Trailers12k, we propose Dual Image and Video Transformer Architecture (DIViTA), which performs shot detection so as to segment the trailer into highly correlated clips, providing a more cohesive input for pretrained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

richardtml/divita
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition

MethodsAttention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Multi-Head Attention · Label Smoothing · Absolute Position Encodings · Layer Normalization