Towards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLMs in Book-Length Contexts

Jiaqi Deng; Yuho Lee; Nicole Hee-Yeon Kim; Hyangsuk Min; Taewon Yun; Minjeong Ban; Kim Yul; Hwanjun Song

arXiv:2508.19578·cs.CL·August 28, 2025

Towards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLMs in Book-Length Contexts

Jiaqi Deng, Yuho Lee, Nicole Hee-Yeon Kim, Hyangsuk Min, Taewon Yun, Minjeong Ban, Kim Yul, Hwanjun Song

PDF

1 Video

TL;DR

HAMLET is an automated framework that evaluates large language models' comprehension of long texts by structuring content hierarchically and using query-based summaries, revealing strengths and weaknesses across different model types and scales.

Contribution

We propose HAMLET, a novel automated evaluation framework for multi-level comprehension of LLMs in book-length contexts, validated by high agreement with human judgments.

Findings

01

LLMs struggle with fine-grained, leaf-level comprehension.

02

Performance varies significantly between open-source and proprietary models.

03

Model performance is affected by positional effects within long texts.

Abstract

We introduce HAMLET, a holistic and automated framework for evaluating the long-context comprehension of large language models (LLMs). HAMLET structures source texts into a three-level key-fact hierarchy at root-, branch-, and leaf-levels, and employs query-focused summarization to evaluate how well models recall and faithfully represent information at each level. To validate the reliability of our fully automated pipeline, we conduct a systematic human study, showing that our automatic evaluation achieves over 90% agreement with expert human judgments, while reducing the cost by up to 25 times. HAMLET reveals that LLMs struggle with fine-grained comprehension, especially at the leaf level, and are sensitive to positional effects like the lost-in-the-middle. Analytical queries pose greater challenges than narrative ones, and consistent performance gaps emerge between open-source and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Towards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLMs in Book-Length Contexts· underline