Who's Waldo? Linking People Across Text and Images

Claire Yuqing Cui; Apoorv Khandelwal; Yoav Artzi; Noah Snavely; Hadar; Averbuch-Elor

arXiv:2108.07253·cs.CV·August 18, 2021

Who's Waldo? Linking People Across Text and Images

Claire Yuqing Cui, Apoorv Khandelwal, Yoav Artzi, Noah Snavely, Hadar, Averbuch-Elor

PDF

1 Repo

TL;DR

This paper introduces a new task and dataset for linking people mentioned in captions to their images, emphasizing contextual cues over appearance, and proposes a Transformer-based method that outperforms baselines.

Contribution

The paper presents a novel person-centric visual grounding task, a new dataset called Who's Waldo, and a Transformer-based approach that advances contextual understanding in vision-language models.

Findings

01

Transformer-based method outperforms baselines

02

Dataset enables focus on contextual cues

03

Benchmark facilitates future research in visual grounding

Abstract

We present a task and benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image. In contrast to prior work in visual grounding, which is predominantly object-based, our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues (such as rich interactions between multiple people), rather than learning associations between names and appearances. To facilitate this task, we introduce a new dataset, Who's Waldo, mined automatically from image-caption data on Wikimedia Commons. We propose a Transformer-based method that outperforms several strong baselines on this task, and are releasing our data to the research community to spur work on contextual models that consider both vision and language.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

clairecyq/whos-waldo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.