VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text
Tianyu Zhang, Suyuchen Wang, Lu Li, Ge Zhang, Perouz Taslakian, Sai, Rajeswar, Jie Fu, Bang Liu, Yoshua Bengio

TL;DR
This paper introduces VCR, a new vision-language task requiring models to restore obscured text in images using pixel-level cues, highlighting the challenge of integrating visual and textual information beyond traditional OCR methods.
Contribution
The paper presents a novel pixel-level reasoning task, a synthetic dataset VCR-Wiki with 2.11M images, and demonstrates the limitations of current models in this complex text restoration challenge.
Findings
Current models lag behind human performance in VCR.
Fine-tuning on the dataset does not significantly improve results.
The dataset enables future research in pixel-level reasoning for vision-language models.
Abstract
We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images. While numerous works have integrated text embedded in images into visual question-answering tasks, approaches to these tasks generally rely on optical character recognition or masked language modeling, thus reducing the task to mainly text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny exposed areas of masked texts. We develop a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Advanced Vision and Imaging
MethodsALIGN
