Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text
Ayush Jaiswal, Ekraam Sabir, Wael AbdAlmageed, Premkumar Natarajan

TL;DR
This paper introduces a deep learning approach that uses joint embedding of images and captions to assess the semantic integrity of multimedia packages, effectively detecting manipulations across diverse datasets.
Contribution
It presents a novel multimodal representation learning framework and a new dataset for evaluating multimedia integrity verification methods.
Findings
Achieves high F1 scores on multiple datasets for detecting incoherent media packages.
Develops a joint embedding framework for image-caption consistency assessment.
Provides a new dataset (MAIM) for multimedia integrity research.
Abstract
Real world multimedia data is often composed of multiple modalities such as an image or a video with associated text (e.g. captions, user comments, etc.) and metadata. Such multimodal data packages are prone to manipulations, where a subset of these modalities can be altered to misrepresent or repurpose data packages, with possible malicious intent. It is, therefore, important to develop methods to assess or verify the integrity of these multimedia packages. Using computer vision and natural language processing methods to directly compare the image (or video) and the associated caption to verify the integrity of a media package is only possible for a limited set of objects and scenes. In this paper, we present a novel deep learning-based approach for assessing the semantic integrity of multimedia packages containing images and captions, using a reference set of multimedia packages. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
