LocTex: Learning Data-Efficient Visual Representations from Localized   Textual Supervision

Zhijian Liu; Simon Stent; Jie Li; John Gideon; Song Han

arXiv:2108.11950·cs.CV·August 27, 2021

LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

Zhijian Liu, Simon Stent, Jie Li, John Gideon, Song Han

PDF

Open Access

TL;DR

LocTex leverages low-cost localized textual annotations like captions and mouse traces to pre-train visual models, reducing annotation effort while maintaining or improving performance on various vision tasks.

Contribution

The paper introduces LocTex, a contrastive pre-training framework that uses captions and mouse traces for effective, data-efficient visual representation learning.

Findings

01

Reduces pre-training dataset size by 10x without performance loss

02

Achieves comparable or better results than ImageNet pre-training on COCO

03

Outperforms previous vision+language pre-training by 4% on PASCAL VOC

Abstract

Computer vision tasks such as object detection and semantic/instance segmentation rely on the painstaking annotation of large training datasets. In this paper, we propose LocTex that takes advantage of the low-cost localized textual annotations (i.e., captions and synchronized mouse-over gestures) to reduce the annotation effort. We introduce a contrastive pre-training framework between images and captions and propose to supervise the cross-modal attention map with rendered mouse traces to provide coarse localization signals. Our learned visual features capture rich semantics (from free-form captions) and accurate localization (from mouse traces), which are very effective when transferred to various downstream vision tasks. Compared with ImageNet supervised pre-training, LocTex can reduce the size of the pre-training dataset by 10x or the target dataset by 2x while achieving comparable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning