Large Language Models Are Active Critics in NLG Evaluation
Shuying Xu, Junjie Hu, Ming Jiang

TL;DR
This paper introduces Active-Critic, an innovative LLM-based evaluation method that adapts to diverse NLG tasks by inferring criteria and optimizing prompts, resulting in evaluations more aligned with human judgments.
Contribution
The paper presents a novel active critic framework that enables LLMs to adaptively evaluate NLG outputs using limited examples, improving flexibility and alignment with human standards.
Findings
Active-Critic achieves better alignment with human judgments.
It can infer evaluation criteria from limited data.
The method produces nuanced, context-aware scores.
Abstract
The conventional paradigm of using large language models (LLMs) for natural language generation (NLG) evaluation relies on pre-defined task definitions and evaluation criteria, positioning LLMs as "passive critics" that strictly follow developer-provided guidelines. However, human evaluators often apply implicit criteria, and their expectations in practice can vary widely based on specific end-user needs. Consequently, these rigid evaluation methods struggle to adapt to diverse scenarios without extensive prompt customization. To address this, we introduce Active-Critic, a novel LLM-based evaluator that transforms LLMs into "active critics'' capable of adapting to diverse NLG tasks using limited example data. Active-Critic consists of two stages: (1) self-inferring the target NLG task and relevant evaluation criteria, and (2) dynamically optimizing prompts to produce human-aligned…
Peer Reviews
Decision·Submitted to ICLR 2025
- As far as I’m aware, I haven’t encountered works that apply prompt optimization for NLG evaluation, and so automatically designing the NLG evaluation prompts seems to be quite novel and clearly a highly practical and useful application. Also their approach decomposes the evaluation into multiple different aspects (e.g. task, criteria, few-shot examples, explainability). - They provide ablations numbers to make clearer where the performance improvements originate from, showing that all of the
- The experimental results have not comprehensively demonstrated the advantages of the active-critic approach. The examined models are somewhat limited (ORCA-13B and GPT3.5) and also the baseline compared to appear somewhat limited inconsistent (e.g. G-EVAL not used in Table 1 and Table 2 only compares to G-EVAL). Whether the approach remains equally as effective when using larger more capables, which may possibly be better aligned, is not clear. - Is the contribution of the work the simplicit
This work highlights a limitation of current LLM-based NLG evaluation methods: they often rely on specific evaluation instructions to perform passive evaluations. While human-crafted evaluation criteria may intuitively enhance the controllability, they lead to high costs. To address this, this work proposes a method for enabling models to conduct active evaluations and conducts experiments on multiple NLG evaluation tasks.
Although the concept of active evaluation is meaningful, approaches with similar motivation have been explored in previous works (Liu et al., 2024; Li et al., 2024; Liu et al., 2024). Furthermore, prior research has pointed out that generating explanations along with scores can enhance evaluation performance (Chiang et al., 2023; Chiang et al., 2023). The proposed method requires some training data to fine-tune prompts, which significantly reduces its generalizability (how would it handle evalu
1. The paper presents a straightforward approach that is easy to understand and follow. 2. Addressing NLG evaluation is essential, especially as LLMs continue to evolve, making evaluation standards increasingly important. 3. The two-stage design in ACTIVE-CRITIC enhances explainability and flexibility, allowing for better alignment with human judgments without extensive manual prompt engineering.
1. The proposed approach is largely a pipeline with prompt engineering. While it enables LLMs to determine evaluation aspects independently, this method appears as a direct and relatively simple extension of prompt engineering, rather than a novel framework. 2. NLG evaluation aims to approximate human judgment, with scoring criteria derived from human-annotated datasets. The paper’s method, which uses LLMs to learn and generate these criteria, may seem contradictory, as it implies that while hum
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling
