AutoScraper: A Progressive Understanding Web Agent for Web Scraper   Generation

Wenhao Huang; Zhouhong Gu; Chenghao Peng; Zhixu Li; Jiaqing Liang,; Yanghua Xiao; Liqian Wen; Zulong Chen

arXiv:2404.12753·cs.CL·September 27, 2024·1 cites

AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation

Wenhao Huang, Zhouhong Gu, Chenghao Peng, Zhixu Li, Jiaqing Liang,, Yanghua Xiao, Liqian Wen, Zulong Chen

PDF

Open Access 2 Repos 1 Video

TL;DR

AutoScraper is a novel framework that uses large language models to generate adaptable web scrapers by leveraging HTML structure and page similarity, improving efficiency and reusability across diverse websites.

Contribution

It introduces a two-stage LLM-based framework for web scraper generation that handles diverse web environments more effectively than existing methods.

Findings

01

AutoScraper outperforms existing methods in adaptability and efficiency.

02

The hierarchical HTML structure improves scraper accuracy.

03

The new executability metric better evaluates scraper performance.

Abstract

Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website, while language agents, empowered by large language models (LLMs), exhibit poor reusability in diverse web environments. In this work, we introduce the paradigm of generating web scrapers with LLMs and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently. AutoScraper leverages the hierarchical structure of HTML and similarity across different web pages for generating web scrapers. Besides, we propose a new executability metric for better measuring the performance of web scraper generation tasks. We conduct comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation· underline

Taxonomy

TopicsWeb Data Mining and Analysis · Advanced Malware Detection Techniques