Cosine Similarity of Multimodal Content Vectors for TV Programmes
Saba Nazir, Taner Cagali, Chris Newell, Mehrnoosh Sadrzadeh

TL;DR
This paper presents a multimodal content representation method for TV programmes using vector-based features from audiovisual, textual, and metadata sources, enhancing recommendation quality.
Contribution
It introduces a fusion approach combining spectral, textual, and categorical features to improve content similarity measures for TV programme recommendations.
Findings
Late fusion significantly improves recommendation precision.
Fused representations increase recommendation diversity.
Multimodal vectors effectively capture content semantics.
Abstract
Multimodal information originates from a variety of sources: audiovisual files, textual descriptions, and metadata. We show how one can represent the content encoded by each individual source using vectors, how to combine the vectors via middle and late fusion techniques, and how to compute the semantic similarities between the contents. Our vectorial representations are built from spectral features and Bags of Audio Words, for audio, LSI topics and Doc2vec embeddings for subtitles, and the categorical features, for metadata. We implement our model on a dataset of BBC TV programmes and evaluate the fused representations to provide recommendations. The late fused similarity matrices significantly improve the precision and diversity of recommendations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Natural Language Processing Techniques
