View From Above: A Framework for Evaluating Distribution Shifts in Model Behavior
Tanush Chopra, Michael Li, Jacob Haimes

TL;DR
This paper introduces a domain-agnostic framework to evaluate distribution shifts in large language models' decision-making, revealing potential behavioral misalignments through systematic testing in a blackjack environment.
Contribution
It presents a novel, systematic method for detecting distribution shifts in LLMs' behavior across different environments and tasks.
Findings
Significant distribution shifts detected in LLMs' decision-making.
Behavioral misalignments observed in over 1,000 blackjack trials.
Framework applicable across various domains for evaluating model robustness.
Abstract
When large language models (LLMs) are asked to perform certain tasks, how can we be sure that their learned representations align with reality? We propose a domain-agnostic framework for systematically evaluating distribution shifts in LLMs decision-making processes, where they are given control of mechanisms governed by pre-defined rules. While individual LLM actions may appear consistent with expected behavior, across a large number of trials, statistically significant distribution shifts can emerge. To test this, we construct a well-defined environment with known outcome logic: blackjack. In more than 1,000 trials, we uncover statistically significant evidence suggesting behavioral misalignment in the learned representations of LLM.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics in Business and Education · Information and Cyber Security · Securities Regulation and Market Practices
MethodsALIGN
