Movie Trailer Genre Classification Using Multimodal Pretrained Features
Serkan Sulun, Paula Viana, Matthew E. P. Davies

TL;DR
This paper presents a multimodal approach using pretrained models and transformer-based fusion to improve movie trailer genre classification, outperforming existing methods in accuracy and efficiency.
Contribution
The paper introduces a novel multimodal fusion method with pretrained features and transformer models that captures complex dependencies without temporal pooling.
Findings
Outperforms state-of-the-art models in genre classification metrics
Utilizes all video and audio frames without temporal pooling
Provides publicly available pretrained features, code, and models
Abstract
We introduce a novel method for movie genre classification, capitalizing on a diverse set of readily accessible pretrained models. These models extract high-level features related to visual scenery, objects, characters, text, speech, music, and audio effects. To intelligently fuse these pretrained features, we train small classifier models with low time and memory requirements. Employing the transformer model, our approach utilizes all video and audio frames of movie trailers without performing any temporal pooling, efficiently exploiting the correspondence between all elements, as opposed to the fixed and low number of frames typically used by traditional methods. Our approach fuses features originating from different tasks and modalities, with different dimensionalities, different temporal lengths, and complex dependencies as opposed to current approaches. Our method outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
