Method for Aggregating Unstructured Data Using Large Language Models
Vsevolod Lazebnyi, Natalia Tereshkina, Maria Shabarina, Dmitriy Fedorov

TL;DR
This paper introduces a robust method combining web scraping, LLMs, and verification techniques to automate and improve the aggregation of unstructured web data, addressing instability and manual effort issues.
Contribution
It presents a novel two-stage verification process for LLM-generated data, enhancing accuracy and robustness in dynamic web content aggregation.
Findings
High accuracy in key field completion
Robustness to webpage structure changes
Scalable for real-time news and log analysis
Abstract
This paper presents a method for the automated collection and aggregation of unstructured data from diverse web sources, utilizing Large Language Models (LLMs). The primary challenge with existing techniques is their instability when the structure of webpages changes, their limited support for dynamically loaded content during information collection, and the requirement for labor-intensive manual design of data pre-processing processes. The proposed algorithm integrates hybrid web scraping (Goose3 for static pages and Selenium+WebDriver for dynamic ones), data storage in a non-relational MongoDB database management system (DBMS), and intelligent extraction and normalization of information using LLMs into a predetermined JSON schema. A key scientific contribution of this study is a two-stage verification process for the generated data, designed to eliminate potential hallucinations byy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
