THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models
Mengfei Liang, Archish Arun, Zekun Wu, Cristian Munoz, Jonathan Lutch,, Emre Kazim, Adriano Koshiyama, Philip Treleaven

TL;DR
THaMES is an integrated framework that automates the evaluation and mitigation of hallucinations in large language models, improving their factual accuracy across various tasks with adaptable strategies.
Contribution
It introduces an end-to-end, standardized pipeline for hallucination detection and mitigation, combining automated test set generation, benchmarking, and multiple mitigation techniques.
Findings
Commercial models like GPT-4o benefit more from RAG.
Open-weight models like Llama-3.1-8B-Instruct gain from ICL.
PEFT improves Llama-3.1-8B-Instruct's performance.
Abstract
Hallucination, the generation of factually incorrect content, is a growing challenge in Large Language Models (LLMs). Existing detection and mitigation methods are often isolated and insufficient for domain-specific needs, lacking a standardized pipeline. This paper introduces THaMES (Tool for Hallucination Mitigations and EvaluationS), an integrated framework and library addressing this gap. THaMES offers an end-to-end solution for evaluating and mitigating hallucinations in LLMs, featuring automated test set generation, multifaceted benchmarking, and adaptable mitigation strategies. It automates test set creation from any corpus, ensuring high data quality, diversity, and cost-efficiency through techniques like batch processing, weighted sampling, and counterfactual validation. THaMES assesses a model's ability to detect and reduce hallucinations across various tasks, including text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health Research Topics · Pharmacovigilance and Adverse Drug Reactions · Epilepsy research and treatment
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Sparse Evolutionary Training · Attention Dropout · WordPiece · Dense Connections · Residual Connection · Linear Layer · Multi-Head Attention · Linear Warmup With Linear Decay · Adam
