FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow

Haoyu Sun; Huichen Will Wang; Jiawei Gu; Linjie Li; Yu Cheng

arXiv:2505.17399·cs.CL·May 27, 2025

FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow

Haoyu Sun, Huichen Will Wang, Jiawei Gu, Linjie Li, Yu Cheng

PDF

Open Access 1 Repo 3 Reviews

TL;DR

FullFront introduces a comprehensive benchmark for evaluating Multimodal Large Language Models across the entire front-end web development process, highlighting current limitations and performance gaps compared to human experts.

Contribution

This work presents the first full pipeline benchmark for MLLMs in front-end engineering, including real-world webpage transformation and diverse task evaluation.

Findings

01

MLLMs show significant limitations in webpage perception and code generation.

02

Current models underperform human experts in front-end tasks.

03

Performance varies notably across different models and tasks.

Abstract

Front-end engineering involves a complex workflow where engineers conceptualize designs, translate them into code, and iteratively refine the implementation. While recent benchmarks primarily focus on converting visual designs to code, we present FullFront, a benchmark designed to evaluate Multimodal Large Language Models (MLLMs) \textbf{across the full front-end development pipeline}. FullFront assesses three fundamental tasks that map directly to the front-end engineering pipeline: Webpage Design (conceptualization phase), Webpage Perception QA (comprehension of visual organization and elements), and Webpage Code Generation (implementation phase). Unlike existing benchmarks that use either scraped websites with bloated code or oversimplified LLM-generated HTML, FullFront employs a novel, two-stage process to transform real-world webpages into clean, standardized HTML while maintaining…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- New resource for benchmarking MLLM capability on front-end engineering. - Relatively thorough coverage of different capabilities and model families. - Nice to include to human evaluation too.

Weaknesses

- I'm not fully convinced by why we need an aggregate benchmark for front-end development. Some of the capabilities are quite distinct. For example, design (image) generation vs QA vs code generation. You have to use different models for the benchmarking because most models can't do image generation at all. Then what's the point of putting all of these tasks into one benchmark? - I understand the authors put in effort to curate new data for many of the tasks in the benchmark. But I believe for

Reviewer 02Rating 6Confidence 4

Strengths

+ The benchmark mirrors real front-end workflows rather than a single slice: it covers conceptualization (Webpage Design), perception (Webpage Perception QA), and implementation (Webpage Code Generation) with concrete task counts across eight subtasks. This end-to-end framing is rare and useful for diagnosing capability gaps. + The dataset is grounded in real webpages and reconstructed into standardized, copyright-safe HTML through a two-stage, MLLM-assisted pipeline, addressing common issues of

Weaknesses

- Several construction and scoring steps rely on proprietary models (e.g., GPT-4o/Claude in the pipeline; Gemini-based visual scoring), which can introduce system bias and limit strict reproducibility; data/code release is contingent on acceptance. - The engineered “Code Score” aggregates DOM and style attributes with fixed design choices; even with strong human correlation, such choices may privilege particular implementation patterns and under-reward acceptable alternatives. - The results clea

Reviewer 03Rating 4Confidence 4

Strengths

1. Compared to prior works, FullFront transforms real-world websites into clean, standardized HTML to avoid copyright issues. 2. Many models are comprehensively benchmarked on three subtasks.

Weaknesses

1. While the authors claim they unify multiple components into one cohesive evaluation pipeline, the implementation and results look like three separate benchmarks to me; the analysis of the connection between different parts seems weak. 2. Webpage Design seems to benchmark text-to-image generation capability. This part is kinda less motivated, since why do we want such MLLM to generate a website in image form? What makes it necessary to ask them to generate an image instead of generating code +

Code & Models

Repositories

mikivishy/fullfront
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTunneling and Rock Mechanics · Drilling and Well Engineering

MethodsFocus