HTLM: Hyper-Text Pre-Training and Prompting of Language Models
Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu,, Gargi Ghosh, Luke Zettlemoyer

TL;DR
HTLM is a novel hyper-text language model trained on large-scale web data, leveraging HTML structure for improved zero-shot and fine-tuned performance across various NLP tasks, especially in summarization and classification.
Contribution
This work introduces HTLM, a hyper-text language model trained on HTML data, utilizing structured prompts and HTML semantics to enhance transfer learning and zero-shot capabilities.
Findings
HTLM outperforms comparable text-only models on classification tasks.
HTLM achieves state-of-the-art results in zero-shot summarization.
Hyper-text prompts improve data efficiency over plain text prompts.
Abstract
We introduce HTLM, a hyper-text language model trained on a large-scale web crawl. Modeling hyper-text has a number of advantages: (1) it is easily gathered at scale, (2) it provides rich document-level and end-task-adjacent supervision (e.g. class and id attributes often encode document category information), and (3) it allows for new structured prompting that follows the established semantics of HTML (e.g. to do zero-shot summarization by infilling title tags for a webpage that contains the input text). We show that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels. HTLM matches or exceeds the performance of comparably sized text-only LMs for zero-shot prompting and fine-tuning for classification benchmarks, while also setting new state-of-the-art performance levels for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
