Learning Multi-modal Similarity
Brian McFee, Gert Lanckriet

TL;DR
This paper introduces a novel multiple kernel learning method that integrates heterogeneous multi-media data into a unified similarity space, leveraging human perceptual similarity measurements and graph-based filtering for robustness.
Contribution
It proposes a new kernel ensemble technique for multi-modal data integration that aligns with human perceptual similarity and handles measurement subjectivity.
Findings
Effective integration of multi-modal data into a single similarity space.
Robustness achieved through graph-based filtering of similarity measurements.
Improved performance in multimedia retrieval tasks.
Abstract
In many applications involving multi-media data, the definition of similarity between items is integral to several key tasks, e.g., nearest-neighbor retrieval, classification, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, unified similarity space. Our algorithm learns an optimal ensemble of kernel transfor- mations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multi- media similarity, we develop graph-based techniques to filter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Music and Audio Processing · Advanced Image and Video Retrieval Techniques
