A Reliability Evaluation of Hybrid Deterministic-LLM Based Approaches for Academic Course Registration PDF Information Extraction
Muhammad Anis Al Hilmi, Neelansh Khare, Noel Framil Iglesias

TL;DR
This paper evaluates hybrid deterministic-LLM methods for extracting information from academic registration PDFs, demonstrating high accuracy and efficiency, especially with a Camelot pipeline and Qwen 2.5 models, in resource-limited settings.
Contribution
It introduces and empirically assesses a hybrid approach combining deterministic methods and LLMs for reliable, efficient PDF information extraction in academic contexts.
Findings
Hybrid approach improves efficiency over LLM-only methods.
Camelot pipeline with LLM fallback achieves up to 99-100% accuracy.
Qwen 2.5:14b model shows consistent performance across scenarios.
Abstract
This study evaluates the reliability of information extraction approaches from KRS documents using three strategies: LLM only, Hybrid Deterministic - LLM (regex + LLM), and a Camelot based pipeline with LLM fallback. Experiments were conducted on 140 documents for the LLM based test and 860 documents for the Camelot based pipeline evaluation, covering four study programs with varying data in tables and metadata. Three 12 - 14B LLM models (Gemma 3, Phi 4, and Qwen 2.5) were run locally using Ollama and a consumer grade CPU without a GPU. Evaluations used exact match (EM) and Levenshtein similarity (LS) metrics with a threshold of 0.7. Although not applicable to all models, the results show that the hybrid approach can improve efficiency compared to LLM only, especially for deterministic metadata. The Camelot based pipeline with LLM fallback produced the best combination of accuracy (EM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
