LLMbench: A Comparative Close Reading Workbench for Large Language Models
David M. Berry

TL;DR
LLMbench is a browser-based tool designed for detailed, hermeneutic analysis of large language model outputs, emphasizing interpretability and visualisation of probabilistic text features.
Contribution
It introduces a novel, humanities-oriented interface for comparing LLM responses with analytical overlays and visualisations, focusing on interpretability over quantitative metrics.
Findings
Enables side-by-side comparison of LLM outputs with detailed annotations.
Provides visualisations like heatmaps and probability terrains to interpret model behavior.
Supports critical, hermeneutic analysis of generative AI outputs.
Abstract
LLMbench is a browser-based workbench for the comparative close reading of large language model (LLM) outputs. Where existing tools for LLM comparison, such as Google PAIR's LLM Comparator are engineered for quantitative evaluation and user-rating metrics, LLMbench is oriented towards the hermeneutic practices of the digital humanities. Two model responses to the same prompt are side by side in annotatable panels with four analytical overlays (Probabilities for token-level log-probability inspection, Differences for word-level diff across the two panels, Tone for Hyland-style metadiscourse analysis, and Structure for sentence-level parsing with discourse connective highlighting), alongside five analytical modes, Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, and Cross-Model Divergence, that make the probabilistic structure of generated text legible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
