Vidya: An AI-Driven Modular Pipeline for Archival Automation and Semantic Metadata Enrichment

Cloter Migliorini Filho; Julia Graciela Machado; Edson Armando Silva; Marcella Scoczynski

arXiv:2605.16338·cs.DL·May 19, 2026

Vidya: An AI-Driven Modular Pipeline for Archival Automation and Semantic Metadata Enrichment

Cloter Migliorini Filho, Julia Graciela Machado, Edson Armando Silva, Marcella Scoczynski

PDF

TL;DR

Vidya is a modular AI pipeline that automates metadata enrichment for large-scale digital archives using LLMs and open-source tools, significantly reducing processing time.

Contribution

It introduces a structured, deterministic approach to semantic enrichment of archives with LLMs, constrained by ontologies and validation, enabling scalable, low-cost deployment.

Findings

01

Reduced processing time from decades to days

02

Achieved cost-effective, scalable archival enrichment

03

Ensured compliance with archival standards

Abstract

The large-scale digitization of historical archives has created a paradox: "dark data"-digital objects lacking metadata for retrieval. Manual archival description is slow and expensive, limiting discovery and reuse. We propose Vidya, a modular pipeline that orchestrates Large Language Models (LLMs) and FOSS tools to automate semantic enrichment and archival ingestion at scale. Vidya constrains generations using YAML-defined ontologies and Pydantic validation, producing deterministic, structured JSON outputs from probabilistic models. Developed at Laboratory for Digital Humanities and Innovation (LAMUHDI) of the State University of Ponta Grossa (UEPG), Vidya applies Maker principles and open-source practices to enable low-cost deployment in memory institutions using modest hardware. We compare LLM performance and present a cost-benefit analysis showing major gains, reducing processing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.