Linking Representations with Multimodal Contrastive Learning

Abhishek Arora; Xinmei Yang; Shao-Yu Jheng; Melissa Dell

arXiv:2304.03464·cs.CV·June 25, 2024·1 cites

Linking Representations with Multimodal Contrastive Learning

Abhishek Arora, Xinmei Yang, Shao-Yu Jheng, Melissa Dell

PDF

Open Access

TL;DR

This paper introduces CLIPPINGS, a multimodal contrastive learning approach that aligns vision and language embeddings to improve record linkage, especially in noisy OCR scenarios, outperforming traditional string matching methods.

Contribution

The study develops a novel multimodal contrastive learning framework that leverages both images and OCR texts for improved record linkage, demonstrating significant performance gains over existing methods.

Findings

01

CLIPPINGS outperforms string matching in linking historical Japanese firms.

02

Self-supervised multimodal models surpass traditional string matching techniques.

03

Multimodal pre-training enhances vision-only encoder performance even with single modality at inference.

Abstract

Many applications require linking individuals, firms, or locations across datasets. Most widely used methods, especially in social science, do not employ deep learning, with record linkage commonly approached using string matching techniques. Moreover, existing methods do not exploit the inherently multimodal nature of documents. In historical record linkage applications, documents are typically noisily transcribed by optical character recognition (OCR). Linkage with just OCR'ed texts may fail due to noise, whereas linkage with just image crops may also fail because vision models lack language understanding (e.g., of abbreviations or other different ways of writing firm names). To leverage multimodal learning, this study develops CLIPPINGS (Contrastively LInking Pooled Pre-trained Embeddings). CLIPPINGS aligns symmetric vision and language bi-encoders, through contrastive language-image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Handwritten Text Recognition Techniques · Digital and Traditional Archives Management