TL;DR
HAMLET is an automated framework that evaluates large language models' comprehension of long texts by structuring content hierarchically and using query-based summaries, revealing strengths and weaknesses across different model types and scales.
Contribution
We propose HAMLET, a novel automated evaluation framework for multi-level comprehension of LLMs in book-length contexts, validated by high agreement with human judgments.
Findings
LLMs struggle with fine-grained, leaf-level comprehension.
Performance varies significantly between open-source and proprietary models.
Model performance is affected by positional effects within long texts.
Abstract
We introduce HAMLET, a holistic and automated framework for evaluating the long-context comprehension of large language models (LLMs). HAMLET structures source texts into a three-level key-fact hierarchy at root-, branch-, and leaf-levels, and employs query-focused summarization to evaluate how well models recall and faithfully represent information at each level. To validate the reliability of our fully automated pipeline, we conduct a systematic human study, showing that our automatic evaluation achieves over 90% agreement with expert human judgments, while reducing the cost by up to 25 times. HAMLET reveals that LLMs struggle with fine-grained comprehension, especially at the leaf level, and are sensitive to positional effects like the lost-in-the-middle. Analytical queries pose greater challenges than narrative ones, and consistent performance gaps emerge between open-source and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
