How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

Yapei Chang; Kyle Lo; Mohit Iyyer; Luca Soldaini

arXiv:2602.08808·cs.LG·February 10, 2026

How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

Yapei Chang, Kyle Lo, Mohit Iyyer, Luca Soldaini

PDF

Open Access

TL;DR

This paper introduces How2Everything, a scalable framework that mines web data for procedural tasks, creates an evaluation benchmark, and uses LLM-based scoring to improve goal-conditioned procedure generation in large language models.

Contribution

It presents a comprehensive pipeline including data mining, benchmark creation, LLM-based evaluation, and reinforcement learning to enhance procedural generation in LLMs.

Findings

01

How2Mine extracted 351K procedures from 980K web pages.

02

How2Score achieved 80.5% agreement with human judgments.

03

RL with How2Score improved model performance by over 10 points.

Abstract

Generating step-by-step "how-to" procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce How2Everything, a scalable framework to evaluate and improve goal-conditioned procedure generation. Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics and readily scales to larger corpora. From this pool we build How2Bench, a 7K-example evaluation set balanced across topics. To reliably score model outputs, we develop How2Score, an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal. For low-cost, reproducible evaluation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in Service Interactions · Topic Modeling · Text Readability and Simplification