A Statistical Analysis of LLMs' Self-Evaluation Using Proverbs
Ryosuke Sonoda, Ramya Srinivasan

TL;DR
This paper evaluates LLMs' self-assessment abilities using a new proverb reasoning dataset, revealing their limitations in reasoning, cultural understanding, and biases such as gender stereotypes.
Contribution
Introduces a novel proverb dataset and evaluation method to analyze LLMs' self-evaluation, highlighting their reasoning failures and cultural biases.
Findings
LLMs show inconsistencies in proverb reasoning tasks.
The dataset uncovers gender stereotypes in LLM responses.
Method effectively identifies LLMs' reasoning and cultural understanding issues.
Abstract
Large language models (LLMs) such as ChatGPT, GPT-4, Claude-3, and Llama are being integrated across a variety of industries. Despite this rapid proliferation, experts are calling for caution in the interpretation and adoption of LLMs, owing to numerous associated ethical concerns. Research has also uncovered shortcomings in LLMs' reasoning and logical abilities, raising questions on the potential of LLMs as evaluation tools. In this paper, we investigate LLMs' self-evaluation capabilities on a novel proverb reasoning task. We introduce a novel proverb database consisting of 300 proverb pairs that are similar in intent but different in wordings, across topics spanning gender, wisdom, and society. We propose tests to evaluate textual consistencies as well as numerical consistencies across similar proverbs, and demonstrate the effectiveness of our method and dataset in identifying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDiverse Approaches in Healthcare and Education Studies · Educational Technology and Assessment · Technology and Data Analysis
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Multi-Head Attention · Adam · Softmax · Dropout · Absolute Position Encodings · Label Smoothing · Byte Pair Encoding
