LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources
Joshua Castillo, Ravi Mukkamala

TL;DR
This paper presents the Guardian Parser Pack, an AI-driven system that transforms diverse investigative documents into a unified schema for improved missing-person case analysis, combining deterministic and LLM-assisted methods.
Contribution
It introduces a novel schema-guided extraction pipeline integrating multi-engine parsing, rule-based identification, and LLM-assisted extraction with validation, enhancing data quality in investigations.
Findings
LLM-assisted extraction significantly outperforms deterministic methods in quality (F1=0.8664 vs. 0.2578).
The system improves key-field completeness from 93.23% to 96.97%.
Deterministic parsing is much faster (0.03 s/record) than LLM-based methods (3.95 s/record).
Abstract
Missing-person and child-safety investigations rely on heterogeneous case documents, including structured forms, bulletin-style posters, and narrative web profiles. Variations in layout, terminology, and data quality impede rapid triage, large-scale analysis, and search-planning workflows. This paper introduces the Guardian Parser Pack, an AI-driven parsing and normalization pipeline that transforms multi-source investigative documents into a unified, schema-compliant representation suitable for operational review and downstream spatial modeling. The proposed system integrates (i) multi-engine PDF text extraction with Optical Character Recognition (OCR) fallback, (ii) rule-based source identification with source-specific parsers, (iii) schema-first harmonization and validation, and (iv) an optional Large Language Model (LLM)-assisted extraction pathway incorporating validator-guided…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
