TL;DR
MANTA is a dynamic multi-turn evaluation framework that stress-tests large language models' animal welfare alignment across realistic scenarios using adversarial follow-up questions.
Contribution
It introduces a novel multi-turn assessment method that dynamically generates pressure turns, revealing nuanced model behaviors and weaknesses in welfare reasoning.
Findings
Turn 1 welfare framing is reliable; Turn 2 introduces variance.
Evidence-based capacity attribution is the weakest dimension.
AI governance scenarios elicit stronger welfare reasoning.
Abstract
Single-turn benchmarks such as AnimalHarmBench (AHB) have established important baselines for measuring animal welfare alignment in large language models (LLMs), but they miss a critical failure mode: models that respond appropriately when unpressured may capitulate when follow-up conversational turns introduce economic, social, or authority-based arguments. We introduce MANTA (Multi-turn Assessment for Nonhuman Thinking and Alignment), a dynamic multi-turn evaluation framework built on the Inspect AI platform that stress-tests frontier LLMs across realistic professional and everyday scenarios using adversarially generated follow-up questions. Unlike static benchmarks, MANTA generates pressure turns dynamically from each model's actual responses, producing targeted and realistic adversarial pressure. The framework evaluates models across up to 13 AHB-derived scoring dimensions on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
