Vidya: An AI-Driven Modular Pipeline for Archival Automation and Semantic Metadata Enrichment
Cloter Migliorini Filho, Julia Graciela Machado, Edson Armando Silva, Marcella Scoczynski

TL;DR
Vidya is a modular AI pipeline that automates metadata enrichment for large-scale digital archives using LLMs and open-source tools, significantly reducing processing time.
Contribution
It introduces a structured, deterministic approach to semantic enrichment of archives with LLMs, constrained by ontologies and validation, enabling scalable, low-cost deployment.
Findings
Reduced processing time from decades to days
Achieved cost-effective, scalable archival enrichment
Ensured compliance with archival standards
Abstract
The large-scale digitization of historical archives has created a paradox: "dark data"-digital objects lacking metadata for retrieval. Manual archival description is slow and expensive, limiting discovery and reuse. We propose Vidya, a modular pipeline that orchestrates Large Language Models (LLMs) and FOSS tools to automate semantic enrichment and archival ingestion at scale. Vidya constrains generations using YAML-defined ontologies and Pydantic validation, producing deterministic, structured JSON outputs from probabilistic models. Developed at Laboratory for Digital Humanities and Innovation (LAMUHDI) of the State University of Ponta Grossa (UEPG), Vidya applies Maker principles and open-source practices to enable low-cost deployment in memory institutions using modest hardware. We compare LLM performance and present a cost-benefit analysis showing major gains, reducing processing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
