AXCEL: Automated eXplainable Consistency Evaluation using LLMs
P Aditya Sreekar, Sahil Verma, Suransh Chopra, Sarik Ghazarian,, Abhishek Persad, Narayanan Sadagopan

TL;DR
AXCEL introduces an explainable, generalizable LLM-based consistency evaluation metric that outperforms existing methods across multiple NLP tasks by providing detailed reasoning and pinpointing inconsistencies.
Contribution
This work presents AXCEL, a novel prompt-based consistency metric that offers explanations and is adaptable to various tasks without prompt modifications.
Findings
AXCEL outperforms SOTA metrics in summarization, text generation, and data-to-text tasks.
AXCEL provides detailed explanations for consistency scores.
AXCEL performs well with open source LLMs.
Abstract
Large Language Models (LLMs) are widely used in both industry and academia for various tasks, yet evaluating the consistency of generated text responses continues to be a challenge. Traditional metrics like ROUGE and BLEU show a weak correlation with human judgment. More sophisticated metrics using Natural Language Inference (NLI) have shown improved correlations but are complex to implement, require domain-specific training due to poor cross-domain generalization, and lack explainability. More recently, prompt-based metrics using LLMs as evaluators have emerged; while they are easier to implement, they still lack explainability and depend on task-specific prompts, which limits their generalizability. This work introduces Automated eXplainable Consistency Evaluation using LLMs (AXCEL), a prompt-based consistency metric which offers explanations for the consistency scores by providing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling
