TL;DR
This paper introduces InteractWeb-Bench, a benchmark for evaluating multimodal agents in website generation under realistic, ambiguous user conditions, highlighting current limitations in intent understanding and interaction.
Contribution
It presents the first interactive benchmark with diverse user simulations and an environment for iterative refinement, addressing the gap in real-world, low-code website development scenarios.
Findings
MLLM-based agents often fail due to blind execution and poor intent recognition.
The benchmark simulates diverse user behaviors, including ambiguity and contradiction.
Current agents show limitations in adaptive interaction and requirement understanding.
Abstract
With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing benchmarks rely on idealized assumptions, especially for well-structured, information-rich inputs and static execution settings. In contrast, real-world development is constrained by a critical bottleneck: the semantic misalignment between ambiguous, low-quality instructions from non-expert users and model understanding, which results in a failure mode that we term blind execution. To address this gap, we introduce InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code user conditions. InteractWeb-Bench introduces four types of user agents and persona-driven instruction perturbations to systematically simulate diverse user behaviors,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
