How Useful Is Cross-Domain Generalization for Training LLM Monitors?
Sam Martin, Fabien Roger

TL;DR
This paper investigates whether training language models on multiple classification tasks with different prompts enhances their ability to generalize across domains and prompts, revealing both benefits and limitations.
Contribution
It demonstrates that multi-task classification training partially generalizes to new domains and prompts, and proposes combining it with instruction-following training to improve robustness.
Findings
Training on multiple classification tasks improves performance on unseen domains.
Models struggle when classification prompts change drastically within the same domain.
Combining classification training with instruction-following training mitigates generalization failures.
Abstract
Using prompted language models as classifiers enables classification in domains with limited training data, but misses some of the robustness and performance benefits that fine-tuning can bring. We study whether training on multiple classification tasks, each with its own prompt, improves performance on new domains with new classification prompts. We show that such training partially generalizes to adjacent domains, improving classification performance on tasks that are unseen during training. However, we identify specific edge cases where the fine-tuned models fail to follow prompts, such as when the classification prompt changes completely while the data domain remains the same as during training. We show that classification training can be mixed with general instruction following training, and that (when done well) such training keeps the benefits of classification training and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
