ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu,, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang

TL;DR
ReFocus introduces a framework that enables multimodal large language models to perform visual editing through code, improving structured image understanding tasks like interpreting tables and charts by enhancing visual reasoning capabilities.
Contribution
The paper presents ReFocus, a novel method allowing LLMs to generate visual edits via code, which improves reasoning over structured images without adding extra information.
Findings
ReFocus improves task performance by 11.0% on tables and 6.8% on charts.
Visual chain-of-thought supervision outperforms standard QA data.
ReFocus's visual editing enhances reasoning without extra information.
Abstract
Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training
