LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision
Zhijian Liu, Simon Stent, Jie Li, John Gideon, Song Han

TL;DR
LocTex leverages low-cost localized textual annotations like captions and mouse traces to pre-train visual models, reducing annotation effort while maintaining or improving performance on various vision tasks.
Contribution
The paper introduces LocTex, a contrastive pre-training framework that uses captions and mouse traces for effective, data-efficient visual representation learning.
Findings
Reduces pre-training dataset size by 10x without performance loss
Achieves comparable or better results than ImageNet pre-training on COCO
Outperforms previous vision+language pre-training by 4% on PASCAL VOC
Abstract
Computer vision tasks such as object detection and semantic/instance segmentation rely on the painstaking annotation of large training datasets. In this paper, we propose LocTex that takes advantage of the low-cost localized textual annotations (i.e., captions and synchronized mouse-over gestures) to reduce the annotation effort. We introduce a contrastive pre-training framework between images and captions and propose to supervise the cross-modal attention map with rendered mouse traces to provide coarse localization signals. Our learned visual features capture rich semantics (from free-form captions) and accurate localization (from mouse traces), which are very effective when transferred to various downstream vision tasks. Compared with ImageNet supervised pre-training, LocTex can reduce the size of the pre-training dataset by 10x or the target dataset by 2x while achieving comparable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
