Forecasting Rare Language Model Behaviors
Erik Jones, Meg Tong, Jesse Mu, Mohammed Mahfoud, Jan Leike, Roger, Grosse, Jared Kaplan, William Fithian, Ethan Perez, Mrinank Sharma

TL;DR
This paper presents a forecasting method to predict rare but dangerous behaviors in language models at large deployment scales by analyzing query elicitation probabilities, enabling proactive risk mitigation.
Contribution
Introduces a novel forecasting approach that predicts rare model behaviors at scale based on elicitation probabilities, improving safety assessments before deployment.
Findings
Forecasts can predict dangerous behaviors across up to three orders of magnitude in query volume.
Elicitation probabilities scale predictably with the number of queries.
Method enables proactive identification of rare risks before large-scale deployment.
Abstract
Standard language model evaluations can fail to capture risks that emerge only at deployment scale. For example, a model may produce safe responses during a small-scale beta test, yet reveal dangerous information when processing billions of requests at deployment. To remedy this, we introduce a method to forecast potential risks across orders of magnitude more queries than we test during evaluation. We make forecasts by studying each query's elicitation probability -- the probability the query produces a target behavior -- and demonstrate that the largest observed elicitation probabilities predictably scale with the number of queries. We find that our forecasts can predict the emergence of diverse undesirable behaviors -- such as assisting users with dangerous chemical synthesis or taking power-seeking actions -- across up to three orders of magnitude of query volume. Our work enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
