Evaluating Consistency and Reasoning Capabilities of Large Language Models
Yash Saxena, Sarthak Chopra, Arunendra Mani Tripathi

TL;DR
This paper evaluates the consistency and reasoning abilities of various large language models using the Boolq dataset, revealing that proprietary models outperform public ones but still fall short of high accuracy in both aspects.
Contribution
The study provides a comprehensive comparison of public and proprietary LLMs' consistency and reasoning capabilities using multiple evaluation metrics and datasets.
Findings
Proprietary models outperform public models in both consistency and reasoning.
None of the models achieved 90% accuracy in both metrics.
There is a strong correlation between reasoning and consistency in LLMs.
Abstract
Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance, for tasks such as text generation, summarization, and translation. Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate. This behavior can be attributed to several factors, with consistency and reasoning capabilities being significant contributors. LLMs frequently lack the ability to generate explanations and engage in coherent reasoning, leading to inaccurate responses. Moreover, they exhibit inconsistencies in their outputs. This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs. The experiments utilize the Boolq dataset as the ground truth, comprising questions, answers, and corresponding explanations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · Linear Layer · Adam · Linear Warmup With Linear Decay · Layer Normalization · Multi-Head Attention · Dropout · Attention Dropout
