Visual Prompting with Iterative Refinement for Design Critique Generation
Peitong Duan, Chin-Yi Cheng, Bjoern Hartmann, Yang Li

TL;DR
This paper introduces an iterative visual prompting method using large language models to generate detailed, visually grounded UI design critiques, improving quality and consistency over baseline approaches.
Contribution
It presents a novel iterative visual prompting approach that leverages LLMs for high-quality, region-specific UI critique generation, demonstrating improved performance and generalizability.
Findings
Human experts preferred critiques from our pipeline over baselines.
Our method reduced the gap from human performance by 50%.
The approach outperformed baselines in object and attribute detection tasks.
Abstract
Feedback is crucial for every design process, such as user interface (UI) design, and automating design critiques can significantly improve the efficiency of the design workflow. Although existing multimodal large language models (LLMs) excel in many tasks, they often struggle with generating high-quality design critiques -- a complex task that requires producing detailed design comments that are visually grounded in a given design's image. Building on recent advancements in iterative refinement of text output and visual prompting methods, we propose an iterative visual prompting approach for UI critique that takes an input UI screenshot and design guidelines and generates a list of design comments, along with corresponding bounding boxes that map each comment to a specific region in the screenshot. The entire process is driven completely by LLMs, which iteratively refine both the text…
Peer Reviews
Decision·Submitted to ICLR 2026
**1. Novel Task Formulation** The paper clearly defines a spatially grounded critique generation task, integrating both text feedback and visual localization. **2. Systematic Pipeline Design** Modular decomposition (generation, filtering, refinement) and VLM-based iterative feedback is conceptually clean and potentially extensible. **3. Dataset Contribution** The UICrit dataset, with paired text–bbox annotations, could be useful for future multimodal critique or visual reasoning research.
**1. Limited Empirical Depth** Experiments are restricted to a small set of baselines and models (Gemini-1.5-Pro, GPT-4o). The ablations are shallow; there is no analysis of failure cases, generalization across domains, or robustness. **2. Marginal Quantitative Gains** Although IoU and similarity scores increase slightly with each module, absolute performance remains low (e.g., IoU < 0.36). Human evaluation improvements are modest and may not be statistically significant. **3. Overly Complex
- A VLM pipeline improves single VLM for UI design judgement. - Both visual and textural validation modules improve accuracy. - Shows generalization to other visual grounding tasks.
- The method and idea are simple, with limited novelty. No model training is done. No new dataset is involved. - The method is not tailored much to the target problem, except for a few in context samples used in prompts. - The critique results is still far from human expert.
The method is sound and coherent. Each step in the pipeline makes good sense and ablation studies also validate these gains. A human evaluation of the design critiques is performed. This is currently the best way measure the performance for functionality of this kind. The paper shows improved performance over baseline methods both qualitatively and quantitatively. The proposed method does not require fine tuning and consequently can capitalize on improvements in frontier VLMs, which is
While this is a solid piece of work, the gains are made by relatively obvious extensions of existing techniques. For example, extending iterative refinement [Madaan et al. (2023) and Xu et al. (2024a)] to bounding box prediction. It’s a little hard to tell how much the gains in results would really assist a UI designer. The quantitative gains appear relatively small, although it's hard to assess the scale of the numbers. The qualitative comparisons with the baseline method in the appendix w
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics
