Visual Prompting with Iterative Refinement for Design Critique Generation

Peitong Duan; Chin-Yi Cheng; Bjoern Hartmann; Yang Li

arXiv:2412.16829·cs.AI·May 26, 2025

Visual Prompting with Iterative Refinement for Design Critique Generation

Peitong Duan, Chin-Yi Cheng, Bjoern Hartmann, Yang Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces an iterative visual prompting method using large language models to generate detailed, visually grounded UI design critiques, improving quality and consistency over baseline approaches.

Contribution

It presents a novel iterative visual prompting approach that leverages LLMs for high-quality, region-specific UI critique generation, demonstrating improved performance and generalizability.

Findings

01

Human experts preferred critiques from our pipeline over baselines.

02

Our method reduced the gap from human performance by 50%.

03

The approach outperformed baselines in object and attribute detection tasks.

Abstract

Feedback is crucial for every design process, such as user interface (UI) design, and automating design critiques can significantly improve the efficiency of the design workflow. Although existing multimodal large language models (LLMs) excel in many tasks, they often struggle with generating high-quality design critiques -- a complex task that requires producing detailed design comments that are visually grounded in a given design's image. Building on recent advancements in iterative refinement of text output and visual prompting methods, we propose an iterative visual prompting approach for UI critique that takes an input UI screenshot and design guidelines and generates a list of design comments, along with corresponding bounding boxes that map each comment to a specific region in the screenshot. The entire process is driven completely by LLMs, which iteratively refine both the text…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

**1. Novel Task Formulation** The paper clearly defines a spatially grounded critique generation task, integrating both text feedback and visual localization. **2. Systematic Pipeline Design** Modular decomposition (generation, filtering, refinement) and VLM-based iterative feedback is conceptually clean and potentially extensible. **3. Dataset Contribution** The UICrit dataset, with paired text–bbox annotations, could be useful for future multimodal critique or visual reasoning research.

Weaknesses

**1. Limited Empirical Depth** Experiments are restricted to a small set of baselines and models (Gemini-1.5-Pro, GPT-4o). The ablations are shallow; there is no analysis of failure cases, generalization across domains, or robustness. **2. Marginal Quantitative Gains** Although IoU and similarity scores increase slightly with each module, absolute performance remains low (e.g., IoU < 0.36). Human evaluation improvements are modest and may not be statistically significant. **3. Overly Complex

Reviewer 02Rating 2Confidence 3

Strengths

- A VLM pipeline improves single VLM for UI design judgement. - Both visual and textural validation modules improve accuracy. - Shows generalization to other visual grounding tasks.

Weaknesses

- The method and idea are simple, with limited novelty. No model training is done. No new dataset is involved. - The method is not tailored much to the target problem, except for a few in context samples used in prompts. - The critique results is still far from human expert.

Reviewer 03Rating 6Confidence 3

Strengths

The method is sound and coherent. Each step in the pipeline makes good sense and ablation studies also validate these gains. A human evaluation of the design critiques is performed. This is currently the best way measure the performance for functionality of this kind. The paper shows improved performance over baseline methods both qualitatively and quantitatively. The proposed method does not require fine tuning and consequently can capitalize on improvements in frontier VLMs, which is

Weaknesses

While this is a solid piece of work, the gains are made by relatively obvious extensions of existing techniques. For example, extending iterative refinement [Madaan et al. (2023) and Xu et al. (2024a)] to bounding box prediction. It’s a little hard to tell how much the gains in results would really assist a UI designer. The quantitative gains appear relatively small, although it's hard to assess the scale of the numbers. The qualitative comparisons with the baseline method in the appendix w

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Visualization and Analytics