FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback
Xueqing Wu, Zihan Xue, Da Yin, Shuyan Zhou, Kai-Wei Chang, Nanyun Peng, Yeming Wen

TL;DR
FronTalk introduces a new benchmark for conversational front-end code generation using multi-modal feedback, highlighting key challenges like forgetting and visual interpretation, and proposing a baseline method to address these issues.
Contribution
This paper presents FronTalk, a comprehensive benchmark with a novel evaluation framework and a baseline method to improve multi-turn, multi-modal front-end code generation.
Findings
Models suffer from significant forgetting of previous features.
Interpreting visual feedback remains a major challenge.
AceCoder reduces forgetting and improves performance by up to 9.3%.
Abstract
We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus…
Peer Reviews
Decision·Submitted to ICLR 2026
- This work addresses incorporating multi-modal (text & image) feedback into multi-turn code generation, specifically targeting front-end development, which is underexplored in current benchmarks. - The dataset contains 1,000 conversational turns across 100 dialogues and 3,676 manually refined test cases for robust model evaluation. The data is grounded in real-world websites from diverse domains, thereby increasing the benchmark's practicality and relevance. - Evaluations are comprehensive and
- The dataset is generated using an LLM-based user simulator that generates context-aware instructions conditioned on prior dialogue. For evaluation, first-time users simulated by LLMs interact with each website, and a secondary LLM then compares the resulting trajectories and judges which interface is more usable. In this work, both dataset generation and evaluation rely heavily on LLMs, which can cause multiple problems: the evaluation is not reliable, the user simulator might not generate the
1. The proposed task of conversational frontend coding is both novel and realistic. It is potentially important for real-world human-machine collaborative web development. 2. It is an interesting and valuable finding that existing LLM agents often break functionality introduced in earlier turns. 3. The proposed method, ACECODER, is simple yet effective.
1. The evaluation does not penalize the agent for adding unrequested functions or layout. Such additions may reduce usability, but the score does not reliably reflect that. 2. The evaluation relies mainly on an automatic agent, which is itself a complex problem, so the approach needs rigorous justification that is missing. The authors report a human study with 82.0 accuracy and a Cohen’s kappa of 62.7. It is unclear whether these are sufficient to show that the proposed metrics are a reliable p
- The paper presents a new benchmark focusing on multi-turn, multi-modal front-end development, accompanied by broad model evaluations and several meaningful analyses. - The proposed AceCoder baseline effectively mitigates the forgetting issue.
- The main distinction from WebGen-Bench lies in the multi-turn setup. However, recent works such as WebGen-Agent [1] have already extended single-turn benchmarks into multi-turn scenarios, which suggests that this contribution may not be as technically challenging as implied. - In Section 3.2, while Cohen's kappa is informative for evaluator reliability, it would be more meaningful to analyze intra-model (across different configurations) and inter-model (across models under the same configurat
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Data Visualization and Analytics
