SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing

Anjali Parashar; Yingke Li; Eric Yang Yu; Fei Chen; James Neidhoefer; Devesh Upadhyay; Chuchu Fan

arXiv:2603.01630·cs.AI·March 12, 2026

SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing

Anjali Parashar, Yingke Li, Eric Yang Yu, Fei Chen, James Neidhoefer, Devesh Upadhyay, Chuchu Fan

PDF

Open Access 3 Reviews

TL;DR

SEED-SET is a Bayesian framework for ethically benchmarking autonomous systems, combining objective metrics and stakeholder preferences to efficiently identify high-quality test cases in complex environments.

Contribution

It introduces a hierarchical Gaussian Process model and a novel acquisition strategy for ethical testing, addressing the lack of evaluation metrics and stakeholder subjectivity.

Findings

01

Outperforms baseline methods in test candidate quality.

02

Generates up to 2x more optimal test cases.

03

Improves coverage of high-dimensional search spaces by 1.25x.

Abstract

As autonomous systems such as drones, become increasingly deployed in high-stakes, human-centric domains, it is critical to evaluate the ethical alignment since failure to do so imposes imminent danger to human lives, and long term bias in decision-making. Automated ethical benchmarking of these systems is understudied due to the lack of ubiquitous, well-defined metrics for evaluation, and stakeholder-specific subjectivity, which cannot be modeled analytically. To address these challenges, we propose SEED-SET, a Bayesian experimental design framework that incorporates domain-specific objective evaluations, and subjective value judgments from stakeholders. SEED-SET models both evaluation types separately with hierarchical Gaussian Processes, and uses a novel acquisition strategy to propose interesting test candidates based on learnt qualitative preferences and objectives that align with…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

1. Joint acquisition works. Ablations show the full objective (MI + preference) beats MI-only and Pref-only, especially in higher-D Fire-Rescue. 2. Good visualizations and discussions using practical case-studies.

Weaknesses

1. Author talks about the difficulty in measuring ethical behaviour, especially the subjectiveness of ethics, but then LLM is explicitly told which objective is primary and secondary in subjective evaluation. Hence the underlying preference function is already defined and can be directly expressed as a simple linear combination of the objectives. There is nothing genuinely subjective or hidden for the GP to learn. 2. Narrow ablation coverage. Lacks robustness to LLM model/prompt/temperature chan

Reviewer 02Rating 6Confidence 3

Strengths

S1. Interesting and relevant problem for the community. S2. Well-founded and technically sound solution. The proposed framework builds on solid Bayesian experimental design principles and uses hierarchical variational Gaussian processes to decompose measurable system outcomes and ethical judgments. The mathematical formulation is sound, and the probabilistic modeling choices are well justified. S3. Solid and well-executed experimental evaluation.

Weaknesses

W1. The conceptual formulation of ethical alignment feels oversimplified and somewhat inconsistent with prior literature. - The paper models ethical evaluation as a single real-valued score, whereas ethical considerations are often multi-dimensional or context-dependent, similar in structure to the scenario and observation spaces. - In much of the ethical alignment literature, alignment is defined relationally, i.e., as the consistency between an agent's maximizing reward and an ethical rankin

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper grounds subjective ethical criteria on observable outcomes through a hierarchical formulation (Objective GP $x \rightarrow y$; Subjective GP $y \rightarrow z$), rather than an end-to-end $x \rightarrow z$ mapping, which improves interpretability and sample efficiency. 2. Equation (2) formalizes the evaluation objective as a joint combination of objective information, subjective information, and a preference term, enabling an exploration-exploitation trade-off within the BED framewor

Weaknesses

1. **Overly strong and weakly justified assumptions.** A1 fixes the policy $\pi$ and scenario space $\chi$ while assuming objective evaluations are accessible. A2 presumes users' preferences are truthful and stationary. In reality, preferences are uncertain and dynamic. These assumptions lack sufficient empirical or prior justification. Even if adopted for tractability, their rationality and limitations should be explicitly discussed, as they substantially affect the credibility of real-world et

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Adversarial Robustness in Machine Learning · Gaussian Processes and Bayesian Inference