The Ability of Large Language Models to Evaluate Constraint-satisfaction   in Agent Responses to Open-ended Requests

Lior Madmoni; Amir Zait; Ilia Labzovsky; Danny Karmon

arXiv:2409.14371·cs.CL·September 24, 2024

The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests

Lior Madmoni, Amir Zait, Ilia Labzovsky, Danny Karmon

PDF

Open Access

TL;DR

This paper investigates the ability of large language models to evaluate whether agent responses satisfy complex constraints in open-ended requests, introducing a new dataset and benchmarking their reasoning and arithmetic skills.

Contribution

The paper presents the ACS dataset for evaluating constraint satisfaction and benchmarks LLMs, revealing their limitations and the challenges of few-shot prompting in this context.

Findings

01

Most LLMs show significant room for improvement in constraint evaluation.

02

Errors mainly arise from reasoning difficulties in the models.

03

Few-shot prompting often degrades model performance.

Abstract

Generative AI agents are often expected to respond to complex user requests that have No One Right Answer (NORA), e.g., "design a vegetarian meal plan below 1800 calories". Such requests may entail a set of constraints that the agent should adhere to. To successfully develop agents for NORA scenarios, an accurate automatic evaluation framework is essential, and specifically - one capable of validating the satisfaction of constraints in the agent's response. Recently, large language models (LLMs) have been adopted as versatile evaluators for many NORA tasks, but their ability to evaluate constraint-satisfaction in generated text remains unclear. To study this, we develop and release a novel Arithmetic Constraint-Satisfaction (ACS) benchmarking dataset. The dataset consists of complex user requests with corresponding constraints, agent responses and human labels indicating each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation

MethodsSparse Evolutionary Training