ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation
Jingnan Zheng, Han Wang, An Zhang, Tai D. Nguyen, Jun Sun, Tat-Seng, Chua

TL;DR
ALI-Agent introduces an adaptive, agent-based evaluation framework that automates and refines testing of LLMs' alignment with human values, effectively identifying misalignments and long-tail risks.
Contribution
This work presents ALI-Agent, a novel framework leveraging LLM-powered agents for dynamic, scalable, and in-depth assessment of LLMs' alignment with human values, overcoming static benchmark limitations.
Findings
Effectively identifies model misalignment in stereotypes, morality, and legality.
Generates meaningful, diverse test scenarios for real-world use cases.
Probes long-tail risks with enhanced scenario refinement.
Abstract
Large Language Models (LLMs) can elicit unintended and even harmful content when misaligned with human values, posing severe risks to users and society. To mitigate these risks, current evaluation benchmarks predominantly employ expert-designed contextual scenarios to assess how well LLMs align with human values. However, the labor-intensive nature of these benchmarks limits their test scope, hindering their ability to generalize to the extensive variety of open-world use cases and identify rare but crucial long-tail risks. Additionally, these static tests fail to adapt to the rapid evolution of LLMs, making it hard to evaluate timely alignment issues. To address these challenges, we propose ALI-Agent, an evaluation framework that leverages the autonomous abilities of LLM-powered agents to conduct in-depth and adaptive alignment assessments. ALI-Agent operates through two principal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis
MethodsALIGN
