A Statistical Analysis of LLMs' Self-Evaluation Using Proverbs

Ryosuke Sonoda; Ramya Srinivasan

arXiv:2410.16640·cs.CL·October 23, 2024

A Statistical Analysis of LLMs' Self-Evaluation Using Proverbs

Ryosuke Sonoda, Ramya Srinivasan

PDF

Open Access

TL;DR

This paper evaluates LLMs' self-assessment abilities using a new proverb reasoning dataset, revealing their limitations in reasoning, cultural understanding, and biases such as gender stereotypes.

Contribution

Introduces a novel proverb dataset and evaluation method to analyze LLMs' self-evaluation, highlighting their reasoning failures and cultural biases.

Findings

01

LLMs show inconsistencies in proverb reasoning tasks.

02

The dataset uncovers gender stereotypes in LLM responses.

03

Method effectively identifies LLMs' reasoning and cultural understanding issues.

Abstract

Large language models (LLMs) such as ChatGPT, GPT-4, Claude-3, and Llama are being integrated across a variety of industries. Despite this rapid proliferation, experts are calling for caution in the interpretation and adoption of LLMs, owing to numerous associated ethical concerns. Research has also uncovered shortcomings in LLMs' reasoning and logical abilities, raising questions on the potential of LLMs as evaluation tools. In this paper, we investigate LLMs' self-evaluation capabilities on a novel proverb reasoning task. We introduce a novel proverb database consisting of 300 proverb pairs that are similar in intent but different in wordings, across topics spanning gender, wisdom, and society. We propose tests to evaluate textual consistencies as well as numerical consistencies across similar proverbs, and demonstrate the effectiveness of our method and dataset in identifying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDiverse Approaches in Healthcare and Education Studies · Educational Technology and Assessment · Technology and Data Analysis

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Multi-Head Attention · Adam · Softmax · Dropout · Absolute Position Encodings · Label Smoothing · Byte Pair Encoding