Webscraper: Leverage Multimodal Large Language Models for Index-Content Web Scraping

Guan-Lun Huang; Yuh-Jzer Joung

arXiv:2603.29161·cs.AI·April 1, 2026

Webscraper: Leverage Multimodal Large Language Models for Index-Content Web Scraping

Guan-Lun Huang, Yuh-Jzer Joung

PDF

TL;DR

Webscraper introduces a novel framework using multimodal large language models to autonomously navigate and extract structured data from dynamic, interactive websites, surpassing traditional scraping methods.

Contribution

It presents a new multimodal LLM-based approach with a structured prompting procedure and custom tools for effective web content extraction from modern sites.

Findings

01

Achieves higher extraction accuracy than baseline methods.

02

Successfully applied to news and e-commerce websites.

03

Demonstrates robustness on dynamic, interactive web pages.

Abstract

Modern web scraping struggles with dynamic, interactive websites that require more than static HTML parsing. Current methods are often brittle and require manual customization for each site. To address this, we introduce Webscraper, a framework designed to handle the challenges of modern, dynamic web applications. It leverages a Multimodal Large Language Model (MLLM) to autonomously navigate interactive interfaces, invoke specialized tools, and perform structured data extraction in environments where traditional scrapers are ineffective. Webscraper utilizes a structured five-stage prompting procedure and a set of custom-built tools to navigate and extract data from websites following the common ``index-and-content'' architecture. Our experiments, conducted on six news websites, demonstrate that the full Webscraper framework, equipped with both our guiding prompt and specialized tools,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.