Prompt engineering for bibliographic web-scraping
Manuel Bl\'azquez-Ochando, Juan Jos\'e Prieto-Guti\'errez, Mar\'ia Antonia Ovalle-Perandones

TL;DR
This paper demonstrates how prompt engineering with ChatGPT-4o can efficiently generate fully functional web-scrapers for bibliographic catalogues, minimizing interaction and improving data extraction quality.
Contribution
It introduces a method to use prompt engineering with large language models to automatically develop web-scrapers for bibliographic data extraction.
Findings
Effective model for AI-assisted web-scraper development
Improved scraping quality through context-aware prompts
Minimal interaction needed for functional scraper generation
Abstract
Bibliographic catalogues store millions of data. The use of computer techniques such as web-scraping allows the extraction of data in an efficient and accurate manner. The recent emergence of ChatGPT is facilitating the development of suitable prompts that allow the configuration of scraping to identify and extract information from databases. The aim of this article is to define how to efficiently use prompts engineering to elaborate a suitable data entry model, able to generate in a single interaction with ChatGPT-4o, a fully functional web-scraper, programmed in PHP language, adapted to the case of bibliographic catalogues. As a demonstration example, the bibliographic catalogue of the National Library of Spain with a dataset of thousands of records is used. The findings present an effective model for developing web-scraping programs, assisted with AI and with the minimum possible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Web Data Mining and Analysis · Research Data Management Practices
