KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning
Peiqi Sui, Juan Diego Rodriguez, Philippe Laban, Dean Murphy, Joseph P. Dexter, Richard Jean So, Samuel Baker, Pramit Chaudhuri

TL;DR
KRISTEVA introduces a novel benchmark for assessing large language models' interpretive reasoning in close reading, a key literary analysis skill, revealing current models' limitations compared to human experts.
Contribution
This paper presents the first large-scale close reading benchmark, KRISTEVA, with tasks designed to evaluate interpretive reasoning in literary analysis for LLMs.
Findings
LLMs achieve 49.7%-69.7% accuracy on close reading tasks
Models underperform compared to human evaluators on most tasks
KRISTEVA enables systematic evaluation of interpretive reasoning in LLMs
Abstract
Each year, tens of millions of essays are written and graded in college-level English courses. Students are asked to analyze literary and cultural texts through a process known as close reading, in which they gather textual details to formulate evidence-based arguments. Despite being viewed as a basis for critical thinking and widely adopted as a required element of university coursework, close reading has never been evaluated on large language models (LLMs), and multi-discipline benchmarks like MMLU do not include literature as a subject. To fill this gap, we present KRISTEVA, the first close reading benchmark for evaluating interpretive reasoning, consisting of 1331 multiple-choice questions adapted from classroom data. With KRISTEVA, we propose three progressively more difficult sets of tasks to approximate different elements of the close reading process, which we use to test how…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEducation and Critical Thinking Development
