Alignment midtraining for animals
Jasmine Brazilek, Miles Tidmarsh

TL;DR
This paper explores midtraining alignment for animals using a new dataset, ANIMA, showing that targeted document training improves compassionate reasoning without harming safety benchmarks, but effects can degrade with further instruction tuning.
Contribution
It introduces ANIMA, a novel dataset for evaluating animal compassion, and demonstrates the effectiveness and limitations of midtraining alignment approaches.
Findings
Midtraining with 3000 documents achieves 77% on ANIMA.
Generalization to human compassion observed without degrading safety.
Further instruction-tuning degrades the alignment effect after 5000 samples.
Abstract
We investigate the robustness of value alignment via midtraining with synthetic documents, using animal compassion as a value that is both important in its own right and orthogonal to existing alignment efforts. To evaluate compassionate reasoning, we develop and publicly release Animal Norms In Moral Assessment (ANIMA), a 26-question evaluation spanning 13 ethical dimensions, publicly available as a dataset and Inspect evaluation. On ANIMA, training with 3000 documents achieves 77% compared to 40% for instruction-tuning approaches, with generalization to human compassion and no degradation in standard safety benchmarks or capabilities. However, subsequent unrelated instruction-tuning degrades the intervention, with the advantage disappearing after 5000 samples. Our exploratory results suggest document-based value interventions may require explicit preservation strategies to remain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
