Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation
Callum Canavan, Aditya Shrivastava, Allison Qi, Jonathan Michala, Fabien Roger

TL;DR
This paper critically evaluates the robustness of unsupervised elicitation and easy-to-hard generalization techniques for language models, revealing their limitations on more challenging, realistic datasets that lack ideal properties.
Contribution
It constructs challenging datasets that lack key properties of standard benchmarks and demonstrates the limited effectiveness of existing techniques on these datasets.
Findings
No technique reliably handles the constructed challenges.
Ensembling and combining methods only partially improve performance.
Overcoming these challenges is crucial for future research.
Abstract
To steer language models towards truthful outputs on tasks which are beyond human capability, previous work has suggested training models on easy tasks to steer them on harder ones (easy-to-hard generalization), or using unsupervised training algorithms to steer models with no external labels at all (unsupervised elicitation). Although techniques from both paradigms have been shown to improve model accuracy on a wide variety of tasks, we argue that the datasets used for these evaluations could cause overoptimistic evaluation results. Unlike many real-world datasets, they often (1) have no features with more salience than truthfulness, (2) have balanced training sets, and (3) contain only data points to which the model can give a well-defined answer. We construct datasets that lack each of these properties to stress-test a range of standard unsupervised elicitation and easy-to-hard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
