ReFocus: Visual Editing as a Chain of Thought for Structured Image   Understanding

Xingyu Fu; Minqian Liu; Zhengyuan Yang; John Corring; Yijuan Lu,; Jianwei Yang; Dan Roth; Dinei Florencio; Cha Zhang

arXiv:2501.05452·cs.CV·January 10, 2025

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu,, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang

PDF

Open Access 1 Repo 1 Models 1 Datasets 1 Video

TL;DR

ReFocus introduces a framework that enables multimodal large language models to perform visual editing through code, improving structured image understanding tasks like interpreting tables and charts by enhancing visual reasoning capabilities.

Contribution

The paper presents ReFocus, a novel method allowing LLMs to generate visual edits via code, which improves reasoning over structured images without adding extra information.

Findings

01

ReFocus improves task performance by 11.0% on tables and 6.8% on charts.

02

Visual chain-of-thought supervision outperforms standard QA data.

03

ReFocus's visual editing enhances reasoning without extra information.

Abstract

Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

VTOOL-R1/vtool-r1
pytorch

Models

🤗
ReFocus/Trained_Model
model· ♡ 1
♡ 1

Datasets

ReFocus/ReFocus_Data
dataset· 704 dl
704 dl

Videos

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding· slideslive

Taxonomy

TopicsImage Retrieval and Classification Techniques

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training