Evaluation of LLM-based Strategies for the Extraction of Food Product Information from Online Shops

Christoph Brosch; Sian Brumm; Rolf Krieger; Jonas Scheffler

arXiv:2506.21585·cs.CL·July 8, 2025

Evaluation of LLM-based Strategies for the Extraction of Food Product Information from Online Shops

Christoph Brosch, Sian Brumm, Rolf Krieger, Jonas Scheffler

PDF

TL;DR

This paper evaluates two LLM-based methods for extracting structured food product data from online shops, highlighting a trade-off between accuracy and efficiency, with implications for scalable web data extraction.

Contribution

It introduces and compares direct and indirect LLM-based extraction methods, demonstrating the efficiency and cost benefits of the indirect approach for web data extraction.

Findings

01

Indirect extraction reduces LLM calls by 95.82%.

02

Accuracy of indirect approach is 96.48%, slightly lower than direct.

03

The methods are effective on a dataset of 3,000 food product pages.

Abstract

Generative AI and large language models (LLMs) offer significant potential for automating the extraction of structured information from web pages. In this work, we focus on food product pages from online retailers and explore schema-constrained extraction approaches to retrieve key product attributes, such as ingredient lists and nutrition tables. We compare two LLM-based approaches, direct extraction and indirect extraction via generated functions, evaluating them in terms of accuracy, efficiency, and cost on a curated dataset of 3,000 food product pages from three different online shops. Our results show that although the indirect approach achieves slightly lower accuracy (96.48\%, $- 1.61%$ compared to direct extraction), it reduces the number of required LLM calls by 95.82\%, leading to substantial efficiency gains and lower operational costs. These findings suggest that indirect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.