Prompting a Weighting Mechanism into LLM-as-a-Judge in Two-Step: A Case Study
Wenwen Xie, Gray Gwizdz, Dongji Feng

TL;DR
This paper introduces a prompt design mechanism that enables Large Language Models to better evaluate natural language generation by effectively weighting topic importance, leading to more accurate assessments.
Contribution
It proposes a novel prompt engineering approach that incorporates explicit importance weighting, improving LLMs' judgment accuracy in NLG evaluation tasks.
Findings
Achieved an average 6% improvement in Human Alignment Rate (HAR).
Enhanced LLMs' ability to prioritize relevant information.
Demonstrated effectiveness through a case study.
Abstract
While Large Language Models (LLMs) have emerged as promising tools for evaluating Natural Language Generation (NLG) tasks, their effectiveness is limited by their inability to appropriately weigh the importance of different topics, often overemphasizing minor details while undervaluing critical information, leading to misleading assessments. Our work proposes an efficient prompt design mechanism to address this specific limitation and provide a case study. Through strategic prompt engineering that incorporates explicit importance weighting mechanisms, we enhance using LLM-as-a-Judge ability to prioritize relevant information effectively, as demonstrated by an average improvement of 6% in the Human Alignment Rate (HAR) metric.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLegal Education and Practice Innovations · Artificial Intelligence in Law · Legal Systems and Judicial Processes
