Overview of the Sensemaking Task at the ELOQUENT 2025 Lab: LLMs as Teachers, Students and Evaluators
Pavel \v{S}indel\'a\v{r}, Ond\v{r}ej Bojar

TL;DR
This paper reports on the 2025 Sensemaking shared task, evaluating how well large language models can generate, answer, and evaluate questions based on diverse educational texts across multiple languages.
Contribution
It introduces a new shared task framework for assessing LLMs in a classroom-inspired sensemaking process and presents baseline results and evaluation methods for this challenge.
Findings
Question generation remains challenging to evaluate.
LLMs perform acceptably in question answering but struggle with answer restriction.
Adversarial tests show LLMs can incorrectly rate flawed answers as acceptable.
Abstract
ELOQUENT is a set of shared tasks that aims to create easily testable high-level criteria for evaluating generative language models. Sensemaking is one such shared task. In Sensemaking, we try to assess how well generative models ``make sense out of a given text'' in three steps inspired by exams in a classroom setting: (1) Teacher systems should prepare a set of questions, (2) Student systems should answer these questions, and (3) Evaluator systems should score these answers, all adhering rather strictly to a given set of input materials. We report on the 2025 edition of Sensemaking, where we had 7 sources of test materials (fact-checking analyses of statements, textbooks, transcribed recordings of a lecture, and educational videos) spanning English, German, Ukrainian, and Czech languages. This year, 4 teams participated, providing us with 2 Teacher submissions, 2 Student…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical and Engineering Education
