Method for Aggregating Unstructured Data Using Large Language Models

Vsevolod Lazebnyi; Natalia Tereshkina; Maria Shabarina; Dmitriy Fedorov

arXiv:2604.16425·cs.DB·April 21, 2026

Method for Aggregating Unstructured Data Using Large Language Models

Vsevolod Lazebnyi, Natalia Tereshkina, Maria Shabarina, Dmitriy Fedorov

PDF

TL;DR

This paper introduces a robust method combining web scraping, LLMs, and verification techniques to automate and improve the aggregation of unstructured web data, addressing instability and manual effort issues.

Contribution

It presents a novel two-stage verification process for LLM-generated data, enhancing accuracy and robustness in dynamic web content aggregation.

Findings

01

High accuracy in key field completion

02

Robustness to webpage structure changes

03

Scalable for real-time news and log analysis

Abstract

This paper presents a method for the automated collection and aggregation of unstructured data from diverse web sources, utilizing Large Language Models (LLMs). The primary challenge with existing techniques is their instability when the structure of webpages changes, their limited support for dynamically loaded content during information collection, and the requirement for labor-intensive manual design of data pre-processing processes. The proposed algorithm integrates hybrid web scraping (Goose3 for static pages and Selenium+WebDriver for dynamic ones), data storage in a non-relational MongoDB database management system (DBMS), and intelligent extraction and normalization of information using LLMs into a predetermined JSON schema. A key scientific contribution of this study is a two-stage verification process for the generated data, designed to eliminate potential hallucinations byy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.