AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge
Karen Zhou, Chenhao Tan

TL;DR
AutoChecklist introduces a modular, open-source framework for generating and scoring checklists with LLMs, enhancing interpretability and alignment in evaluation, with validated effectiveness and domain adaptability.
Contribution
The paper presents AutoChecklist, a novel library that unifies checklist-based evaluation into composable pipelines with a flexible taxonomy and supports multiple LLM providers.
Findings
Checklist methods align well with human preferences.
AutoChecklist's pipelines improve evaluation consistency.
The framework supports domain-specific adaptation.
Abstract
Checklists have emerged as a popular approach for interpretable and fine-grained evaluation, particularly with LLM-as-a-Judge. Beyond evaluation, these structured criteria can serve as signals for model alignment, reinforcement learning, and self-correction. To support these use cases, we present AutoChecklist, an open-source library that unifies checklist-based evaluation into composable pipelines. At its core is a taxonomy of five checklist generation abstractions, each encoding a distinct strategy for deriving evaluation criteria. A modular Generator Refiner Scorer pipeline connects any generator with a unified scorer, and new configurations can be registered via prompt templates alone. The library ships with ten built-in pipelines implementing published approaches and supports multiple LLM providers (OpenAI, OpenRouter, vLLM). Beyond the Python API, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
