The Synergy of Automated Pipelines with Prompt Engineering and Generative AI in Web Crawling
Chau-Jian Huang

TL;DR
This paper explores how integrating generative AI tools like Claude AI and ChatGPT-4.0 with prompt engineering can automate and improve web crawling, addressing webpage diversity and anti-scraping challenges.
Contribution
It introduces a novel approach combining AI and prompt engineering to automate web scraping, with empirical evaluation showing Claude AI's superior performance.
Findings
Claude AI outperformed ChatGPT-4.0 in script quality and adaptability
Incorporating anti-scraping tools improved script robustness
Visualizations confirmed Claude AI's higher effectiveness
Abstract
Web crawling is a critical technique for extracting online data, yet it poses challenges due to webpage diversity and anti-scraping mechanisms. This study investigates the integration of generative AI tools Claude AI (Sonnet 3.5) and ChatGPT4.0 with prompt engineering to automate web scraping. Using two prompts, PROMPT I (general inference, tested on Yahoo News) and PROMPT II (element-specific, tested on Coupons.com), we evaluate the code quality and performance of AI-generated scripts. Claude AI consistently outperformed ChatGPT-4.0 in script quality and adaptability, as confirmed by predefined evaluation metrics, including functionality, readability, modularity, and robustness. Performance data were collected through manual testing and structured scoring by three evaluators. Visualizations further illustrate Claude AI's superiority. Anti-scraping solutions, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Distributed and Parallel Computing Systems
