Enhanced Movie Content Similarity Based on Textual, Auditory and Visual Information
Konstantinos Bougiatiotis, Theodore Giannakopoulos

TL;DR
This paper demonstrates that combining low-level textual, auditory, and visual features significantly improves movie similarity estimation for content-based recommendations, outperforming metadata-only methods.
Contribution
It introduces a multimodal feature extraction approach for movies, integrating textual, visual, and audio cues with metadata to enhance similarity measures.
Findings
Multimodal features boost recommendation accuracy by over 50%.
All three modalities contribute significantly to similarity estimation.
First comprehensive approach using combined low-level features across modalities.
Abstract
In this paper we examine the ability of low-level multimodal features to extract movie similarity, in the context of a content-based movie recommendation approach. In particular, we demonstrate the extraction of multimodal representation models of movies, based on textual information from subtitles, as well as cues from the audio and visual channels. With regards to the textual domain, we emphasize our research in topic modeling of movies based on their subtitles, in order to extract topics that discriminate between movies. Regarding the visual domain, we focus on the extraction of semantically useful features that model camera movements, colors and faces, while for the audio domain we adopt simple classification aggregates based on pretrained models. The three domains are combined with static metadata (e.g. directors, actors) to prove that the content-based movie similarity procedure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
