LocCa: Visual Pretraining with Location-aware Captioners

Bo Wan; Michael Tschannen; Yongqin Xian; Filip Pavetic; Ibrahim; Alabdulmohsin; Xiao Wang; Andr\'e Susano Pinto; Andreas Steiner; Lucas Beyer,; Xiaohua Zhai

arXiv:2403.19596·cs.CV·November 13, 2024·1 cites

LocCa: Visual Pretraining with Location-aware Captioners

Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim, Alabdulmohsin, Xiao Wang, Andr\'e Susano Pinto, Andreas Steiner, Lucas Beyer,, Xiaohua Zhai

PDF

Open Access 1 Repo 1 Video

TL;DR

LocCa introduces a location-aware captioning pretraining method that enhances localization capabilities in image understanding models, outperforming standard captioners on localization tasks while maintaining overall performance.

Contribution

The paper presents a novel pretraining approach incorporating location-aware captioners, leveraging multitask encoder-decoder architecture for improved localization in image captioning.

Findings

01

Outperforms standard captioners on localization tasks

02

Maintains comparable performance on holistic image captioning tasks

03

Demonstrates effectiveness of location-aware pretraining

Abstract

Image captioning has been shown as an effective pretraining method similar to contrastive pretraining. However, the incorporation of location-aware information into visual pretraining remains an area with limited research. In this paper, we propose a simple visual pretraining method with location-aware captioners (LocCa). LocCa uses a simple image captioner task interface, to teach a model to read out rich information, i.e. bounding box coordinates, and captions, conditioned on the image pixel input. Thanks to the multitask capabilities of an encoder-decoder architecture, we show that an image captioner can easily handle multiple tasks during pretraining. Our experiments demonstrate that LocCa outperforms standard captioners significantly on localization downstream tasks while maintaining comparable performance on holistic tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/big_vision
jaxOfficial

Videos

LocCa: Visual Pretraining with Location-aware Captioners· slideslive

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Robotics and Sensor-Based Localization