Context Is Not Comprehension
Alex Pan, Mary-Anne Williams

TL;DR
The paper introduces Verbose ListOps (VLO), a benchmark that evaluates multi-step reasoning in language models by embedding deterministic computations in narratives, revealing models' true comprehension beyond mere recall.
Contribution
VLO provides a novel, step-level evaluation framework for reasoning in language models, moving beyond context length limitations and enabling diverse reasoning schemas in narrative form.
Findings
Models solving raw ListOps fail on VLO after 10,000 tokens
VLO exposes reasoning chain divergence points
VLO's pipeline supports various reasoning schemas
Abstract
The dominant way of judging Large Language Models (LLMs) has been to ask how well they can recall explicit facts from very long inputs. While today's best models achieve near perfect recall, this masks a harder skill: performing multi-step reasoning and tracking intermediate state that never appears verbatim. We introduce Verbose ListOps (VLO), a benchmark that embeds deterministic ListOps computations inside narrative camouflage and, crucially, allows step-level evaluation of every intermediate result. Experiments show that models which solve raw ListOps with approximately 100% accuracy collapse on VLO after only 10,000 tokens. By exposing where a model's reasoning chain first diverges, VLO moves assessment beyond sheer context length and toward genuine comprehension. VLO's generation pipeline is task-agnostic: it can weave any deterministically verifiable reasoning schema --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
