A Study on Large Language Models' Limitations in Multiple-Choice   Question Answering

Aisha Khatun; Daniel G. Brown

arXiv:2401.07955·cs.CL·August 16, 2024·1 cites

A Study on Large Language Models' Limitations in Multiple-Choice Question Answering

Aisha Khatun, Daniel G. Brown

PDF

Open Access 1 Repo

TL;DR

This paper systematically analyzes the capabilities and limitations of 26 small open-source Large Language Models in answering multiple-choice questions, revealing significant misunderstandings and the need for careful evaluation.

Contribution

It provides the first comprehensive assessment of small open-source LLMs' performance on MCQ tasks, highlighting their deficiencies and the importance of task understanding.

Findings

01

65% of models do not understand MCQ tasks

02

Only 4 models correctly select answers from choices

03

Just 5 models are choice order independent

Abstract

The widespread adoption of Large Language Models (LLMs) has become commonplace, particularly with the emergence of open-source models. More importantly, smaller models are well-suited for integration into consumer devices and are frequently employed either as standalone solutions or as subroutines in various AI tasks. Despite their ubiquitous use, there is no systematic analysis of their specific capabilities and limitations. In this study, we tackle one of the most widely used tasks - answering Multiple Choice Question (MCQ). We analyze 26 small open-source models and find that 65% of the models do not understand the task, only 4 models properly select an answer from the given choices, and only 5 of these models are choice order independent. These results are rather alarming given the extensive use of MCQ tests with these models. We recommend exercising caution and testing task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tanny411/llm-reliability-and-consistency-evaluation
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Expert finding and Q&A systems · Recommender Systems and Techniques