VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models   via Restoring Occluded Text

Tianyu Zhang; Suyuchen Wang; Lu Li; Ge Zhang; Perouz Taslakian; Sai; Rajeswar; Jie Fu; Bang Liu; Yoshua Bengio

arXiv:2406.06462·cs.CV·April 21, 2025·1 cites

VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text

Tianyu Zhang, Suyuchen Wang, Lu Li, Ge Zhang, Perouz Taslakian, Sai, Rajeswar, Jie Fu, Bang Liu, Yoshua Bengio

PDF

Open Access 1 Repo 5 Datasets 1 Video

TL;DR

This paper introduces VCR, a new vision-language task requiring models to restore obscured text in images using pixel-level cues, highlighting the challenge of integrating visual and textual information beyond traditional OCR methods.

Contribution

The paper presents a novel pixel-level reasoning task, a synthetic dataset VCR-Wiki with 2.11M images, and demonstrates the limitations of current models in this complex text restoration challenge.

Findings

01

Current models lag behind human performance in VCR.

02

Fine-tuning on the dataset does not significantly improve results.

03

The dataset enables future research in pixel-level reasoning for vision-language models.

Abstract

We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images. While numerous works have integrated text embedded in images into visual question-answering tasks, approaches to these tasks generally rely on optical character recognition or masked language modeling, thus reducing the task to mainly text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny exposed areas of masked texts. We develop a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tianyu-z/vcr
pytorchOfficial

Datasets

Videos

VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text· slideslive

Taxonomy

TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Advanced Vision and Imaging

MethodsALIGN