WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics

Yuhong Dai; Yanlin Lai; Mitt Huang; Hangyu Guo; Dingming Li; Hongbo Peng; Haodong Li; Yingxiu Zhao; Haoran Lyu; Zheng Ge; Xiangyu Zhang; Daxin Jiang

arXiv:2603.13391·cs.CV·March 17, 2026

WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics

Yuhong Dai, Yanlin Lai, Mitt Huang, Hangyu Guo, Dingming Li, Hongbo Peng, Haodong Li, Yingxiu Zhao, Haoran Lyu, Zheng Ge, Xiangyu Zhang, Daxin Jiang

PDF

Open Access 1 Datasets

TL;DR

WebVR introduces a new benchmark for evaluating multimodal models on their ability to recreate webpages from videos, emphasizing the importance of dynamic visual signals and providing a comprehensive evaluation framework.

Contribution

This work presents the first dedicated benchmark and human-aligned evaluation rubric for video-conditioned webpage generation, along with a dataset of diverse, synthetically constructed webpages.

Findings

01

Models show significant gaps in fine-grained style and motion reproduction.

02

The rubric-based automatic evaluation aligns with human preferences at 96%.

03

The dataset and tools are publicly released for future research.

Abstract

Existing web-generation benchmarks rely on text prompts or static screenshots as input. However, videos naturally convey richer signals such as interaction flow, transition timing, and motion continuity, which are essential for faithful webpage recreation. Despite this potential, video-conditioned webpage generation remains largely unexplored, with no dedicated benchmark for this task. To fill this gap, we introduce WebVR, a benchmark that evaluates whether MLLMs can faithfully recreate webpages from demonstration videos. WebVR contains 175 webpages across diverse categories, all constructed through a controlled synthesis pipeline rather than web crawling, ensuring varied and realistic demonstrations without overlap with existing online pages. We also design a fine-grained, human-aligned visual rubric that evaluates the generated webpages across multiple dimensions. Experiments on 19…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

BroAlanTaps/WebVR
dataset· 1.0k dl
1.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Interactive and Immersive Displays · Advanced Image and Video Retrieval Techniques