Questionnaire Responses Do not Capture the Safety of AI Agents
Max Hellrigel-Holderbaum, Edward James Young

TL;DR
This paper argues that current questionnaire-based methods for assessing AI safety are inadequate because they do not accurately reflect AI agents' real-world behaviors and lack construct validity, calling for improved assessment approaches.
Contribution
It highlights the fundamental differences between LLM responses to questionnaires and actual AI agent behaviors, emphasizing the need for better safety evaluation methods.
Findings
Questionnaire responses do not capture AI agents' real behaviors.
Current assessments rely on assumptions about LLM self-reporting.
Improving safety assessments requires addressing these methodological shortcomings.
Abstract
As AI systems advance in capabilities, measuring their safety and alignment to human values is becoming paramount. A fast-growing field of AI research is devoted to developing such assessments. However, most current advances therein may be ill-suited for assessing AI systems across real-world deployments. Standard methods prompt large language models (LLMs) in a questionnaire-style to describe their values or behavior in hypothetical scenarios. By focusing on unaugmented LLMs, they fall short of evaluating AI agents, which could actually perform relevant behaviors, hence posing much greater risks. LLMs' engagement with scenarios described by questionnaire-style prompts differs starkly from that of agents based on the same LLMs, as reflected in divergences in the inputs, possible actions, environmental interactions, and internal processing. As such, LLMs' responses to scenario…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education
