WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics
Yuhong Dai, Yanlin Lai, Mitt Huang, Hangyu Guo, Dingming Li, Hongbo Peng, Haodong Li, Yingxiu Zhao, Haoran Lyu, Zheng Ge, Xiangyu Zhang, Daxin Jiang

TL;DR
WebVR introduces a new benchmark for evaluating multimodal models on their ability to recreate webpages from videos, emphasizing the importance of dynamic visual signals and providing a comprehensive evaluation framework.
Contribution
This work presents the first dedicated benchmark and human-aligned evaluation rubric for video-conditioned webpage generation, along with a dataset of diverse, synthetically constructed webpages.
Findings
Models show significant gaps in fine-grained style and motion reproduction.
The rubric-based automatic evaluation aligns with human preferences at 96%.
The dataset and tools are publicly released for future research.
Abstract
Existing web-generation benchmarks rely on text prompts or static screenshots as input. However, videos naturally convey richer signals such as interaction flow, transition timing, and motion continuity, which are essential for faithful webpage recreation. Despite this potential, video-conditioned webpage generation remains largely unexplored, with no dedicated benchmark for this task. To fill this gap, we introduce WebVR, a benchmark that evaluates whether MLLMs can faithfully recreate webpages from demonstration videos. WebVR contains 175 webpages across diverse categories, all constructed through a controlled synthesis pipeline rather than web crawling, ensuring varied and realistic demonstrations without overlap with existing online pages. We also design a fine-grained, human-aligned visual rubric that evaluates the generated webpages across multiple dimensions. Experiments on 19…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Interactive and Immersive Displays · Advanced Image and Video Retrieval Techniques
