Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement

Junrong Guo; Shancheng Fang; Yadong Qu; Hongtao Xie

arXiv:2603.22187·cs.CV·March 24, 2026

Seeing is Improving: Visual Feedback for Iterative Text Layout Refinement

Junrong Guo, Shancheng Fang, Yadong Qu, Hongtao Xie

PDF

Open Access

TL;DR

This paper introduces VFLM, a novel framework that uses visual feedback and reinforcement learning to iteratively improve text layout generation, resulting in more readable and aesthetically pleasing designs.

Contribution

It presents a self-improving, visually grounded layout model that leverages visual feedback for iterative refinement, a novel approach in multimodal layout generation.

Findings

01

VFLM outperforms existing models on multiple benchmarks.

02

Visual feedback significantly improves layout quality.

03

Reinforcement learning with OCR-based rewards enhances refinement.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques