Alignment midtraining for animals

Jasmine Brazilek; Miles Tidmarsh

arXiv:2604.13076·cs.CL·May 5, 2026

Alignment midtraining for animals

Jasmine Brazilek, Miles Tidmarsh

PDF

1 Datasets

TL;DR

This paper explores midtraining alignment for animals using a new dataset, ANIMA, showing that targeted document training improves compassionate reasoning without harming safety benchmarks, but effects can degrade with further instruction tuning.

Contribution

It introduces ANIMA, a novel dataset for evaluating animal compassion, and demonstrates the effectiveness and limitations of midtraining alignment approaches.

Findings

01

Midtraining with 3000 documents achieves 77% on ANIMA.

02

Generalization to human compassion observed without degrading safety.

03

Further instruction-tuning degrades the alignment effect after 5000 samples.

Abstract

We investigate the robustness of value alignment via midtraining with synthetic documents, using animal compassion as a value that is both important in its own right and orthogonal to existing alignment efforts. To evaluate compassionate reasoning, we develop and publicly release Animal Norms In Moral Assessment (ANIMA), a 26-question evaluation spanning 13 ethical dimensions, publicly available as a dataset and Inspect evaluation. On ANIMA, training with 3000 documents achieves 77% compared to 40% for instruction-tuning approaches, with generalization to human compassion and no degradation in standard safety benchmarks or capabilities. However, subsequent unrelated instruction-tuning degrades the intervention, with the advantage disappearing after 5000 samples. Our exploratory results suggest document-based value interventions may require explicit preservation strategies to remain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

sentientfutures/anima
dataset· 6.9k dl
6.9k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.