The Single-File Test: A Longitudinal Public-Interface Evaluation of First-Output LLM Web Generation with Social Reach Tracking
Diego Cabezas Palacios

TL;DR
This study evaluates the performance and social reach of different large language model families in generating HTML content over eight weeks, using human and automated assessments across multiple social media platforms.
Contribution
It provides a longitudinal, public-interface evaluation of first-output LLM web generation, comparing multiple models and analyzing social media impact and quality metrics.
Findings
Claude was the most consistent and strongest model family.
Longer reasoning time did not improve output quality.
Gemini's judge was more lenient than human evaluators.
Abstract
This paper presents an eight-week observational comparison of 68 single-file HTML generations collected across 17 public experiments in the "HTML AI Battle" project between December 10, 2025 and February 4, 2026. Four reasoning model families, GPT, Gemini, Grok, and Claude, were compared under a fixed public-interface protocol with no custom instructions, no personality tuning, and no repair prompts. Each output was evaluated from a rendered browser video using human scores and a Gemini LLM-as-a-judge layer for prompt adherence, functional correctness, and UI quality, then packaged into a standardized social-media protocol spanning X (Twitter), TikTok, and YouTube. The tracker was also used for two supervised predictive analyses: an experiment-level model for 24-hour X impressions and a generation-level model for HTML verbosity. Under this protocol, Claude was the strongest and most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
