Evaluating Consistency and Reasoning Capabilities of Large Language   Models

Yash Saxena; Sarthak Chopra; Arunendra Mani Tripathi

arXiv:2404.16478·cs.CL·April 26, 2024·1 cites

Evaluating Consistency and Reasoning Capabilities of Large Language Models

Yash Saxena, Sarthak Chopra, Arunendra Mani Tripathi

PDF

Open Access

TL;DR

This paper evaluates the consistency and reasoning abilities of various large language models using the Boolq dataset, revealing that proprietary models outperform public ones but still fall short of high accuracy in both aspects.

Contribution

The study provides a comprehensive comparison of public and proprietary LLMs' consistency and reasoning capabilities using multiple evaluation metrics and datasets.

Findings

01

Proprietary models outperform public models in both consistency and reasoning.

02

None of the models achieved 90% accuracy in both metrics.

03

There is a strong correlation between reasoning and consistency in LLMs.

Abstract

Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance, for tasks such as text generation, summarization, and translation. Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate. This behavior can be attributed to several factors, with consistency and reasoning capabilities being significant contributors. LLMs frequently lack the ability to generate explanations and engage in coherent reasoning, leading to inaccurate responses. Moreover, they exhibit inconsistencies in their outputs. This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs. The experiments utilize the Boolq dataset as the ground truth, comprising questions, answers, and corresponding explanations.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · Linear Layer · Adam · Linear Warmup With Linear Decay · Layer Normalization · Multi-Head Attention · Dropout · Attention Dropout