EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata
Chenhao Zheng, Ayush Shrivastava, Andrew Owens

TL;DR
This paper introduces a multimodal embedding model that learns to associate image patches with camera metadata, enabling improved performance on image forensics and calibration tasks, including zero-shot splicing localization.
Contribution
It presents a novel approach to learn cross-modal associations between images and EXIF metadata using a transformer-based model, outperforming existing features.
Findings
Significantly better performance on forensics and calibration tasks.
Effective zero-shot localization of spliced regions.
Outperforms prior self-supervised and supervised methods.
Abstract
We learn a visual representation that captures information about the camera that recorded a given photo. To do this, we train a multimodal embedding between image patches and the EXIF metadata that cameras automatically insert into image files. Our model represents this metadata by simply converting it to text and then processing it with a transformer. The features that we learn significantly outperform other self-supervised and supervised features on downstream image forensics and calibration tasks. In particular, we successfully localize spliced image regions "zero shot" by clustering the visual embeddings for all of the patches within an image.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection · Generative Adversarial Networks and Image Synthesis · Anomaly Detection Techniques and Applications
