Building an Ensemble LLM Semantic Tagger for UN Security Council Resolutions
Hussein Ghaly

TL;DR
This paper presents an ensemble approach using multiple GPT models and new evaluation metrics to improve semantic tagging accuracy of UN Security Council resolutions, ensuring reliability and cost-efficiency.
Contribution
It introduces a novel ensemble methodology with evaluation metrics for selecting optimal LLM outputs in semantic tagging tasks.
Findings
GPT-4.1 achieved CPR 84.9% for cleaning and 99.99% for tagging.
Smaller models like GPT-4.1-mini perform comparably at 20% of the cost.
Ensemble system reliably selects the best output across multiple runs.
Abstract
This paper introduces a new methodology for using LLM-based systems for accurate and efficient semantic tagging of UN Security Council resolutions. The main goal is to leverage LLM performance variability to build ensemble systems for data cleaning and semantic tagging tasks. We introduce two evaluation metrics: Content Preservation Ratio (CPR) and Tag Well-Formedness (TWF), in order to avoid hallucinations and unnecessary additions or omissions to the input text beyond the task requirement. These metrics allow the selection of the best output from multiple runs of several GPT models. GPT-4.1 achieved the highest metrics for both tasks (Cleaning: CPR 84.9% - Semantic Tagging: CPR 99.99% and TWF 99.92%). In terms of cost, smaller models, such as GPT-4.1-mini, achieved comparable performance to the best model in each task at only 20% of the cost. These metrics ultimately allowed the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
