AnyDoc: Enhancing Document Generation via Large-Scale HTML/CSS Data Synthesis and Height-Aware Reinforcement Optimization
Jiawei Lin, Wanrong Zhu, Vlad I Morariu, Christopher Tensmeyer

TL;DR
AnyDoc introduces a large-scale HTML/CSS dataset and a height-aware reinforcement learning method to improve multi-task document generation, achieving superior results over existing models.
Contribution
The paper presents a scalable data synthesis pipeline, a large multi-category dataset DocHTML, and a height-aware reinforcement learning approach for enhanced document generation.
Findings
AnyDoc outperforms baseline models on all tasks.
The dataset covers 111 categories and 32 styles.
Height-aware reinforcement learning reduces content overflow.
Abstract
Document generation has gained growing attention in the field of AI-driven content creation. In this work, we push its boundaries by introducing AnyDoc, a framework capable of handling multiple generation tasks across a wide spectrum of document categories, all represented in a unified HTML/CSS format. To overcome the limited coverage and scale of existing human-crafted document datasets, AnyDoc first establishes a scalable data synthesis pipeline to automatically generate documents in HTML/CSS form. This pipeline yields DocHTML, a large-scale dataset containing 265,206 document samples, while spanning 111 categories and 32 distinct styles. Additionally, all documents are equipped with comprehensive metadata, including design intentions, HTML/CSS source code, visual assets, and rendered screenshots. Building on the curated dataset, AnyDoc fine-tunes multi-modal large language models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Generative Adversarial Networks and Image Synthesis · Software Engineering Research
