MalAlgoQA: Pedagogical Evaluation of Counterfactual Reasoning in Large   Language Models and Implications for AI in Education

Naiming Liu; Shashank Sonkar; Myco Le; Richard Baraniuk

arXiv:2407.00938·cs.CL·October 8, 2024·2 cites

MalAlgoQA: Pedagogical Evaluation of Counterfactual Reasoning in Large Language Models and Implications for AI in Education

Naiming Liu, Shashank Sonkar, Myco Le, Richard Baraniuk

PDF

Open Access 1 Repo

TL;DR

MalAlgoQA introduces a dataset to evaluate large language models' ability to understand flawed reasoning in educational contexts, revealing challenges in counterfactual reasoning and the impact of prompting techniques.

Contribution

The paper presents MalAlgoQA, a new dataset and evaluation framework for assessing counterfactual reasoning in LLMs, highlighting limitations and effects of prompting methods.

Findings

01

State-of-the-art LLMs perform worse on malgorithm identification than on correct rationale identification.

02

Chain-of-thought prompting does not consistently improve counterfactual reasoning performance.

03

Results have implications for AI tutoring systems and addressing student misconceptions.

Abstract

This paper introduces MalAlgoQA, a novel dataset designed to evaluate the counterfactual reasoning capabilities of Large Language Models (LLMs) through a pedagogical approach. The dataset comprises mathematics and reading comprehension questions, each accompanied by four answer choices and their corresponding rationales. At the heart of MalAlgoQA are ``malgorithms'' - rationales behind incorrect answer choices that represent flawed yet logically coherent reasoning paths. These malgorithms serve as counterfactual scenarios, allowing us to assess an LLM's ability to identify and analyze flawed reasoning patterns. We propose the Malgorithm Identification task, where LLMs are assessed based on their ability to identify corresponding malgorithm given an incorrect answer choice. To evaluate the model performance, we introduce two metrics: Algorithm Identification Accuracy (AIA) for correct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

luffycodes/MalAlgoQA-Dataset
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning

MethodsFocus