SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing
Anjali Parashar, Yingke Li, Eric Yang Yu, Fei Chen, James Neidhoefer, Devesh Upadhyay, Chuchu Fan

TL;DR
SEED-SET is a Bayesian framework for ethically benchmarking autonomous systems, combining objective metrics and stakeholder preferences to efficiently identify high-quality test cases in complex environments.
Contribution
It introduces a hierarchical Gaussian Process model and a novel acquisition strategy for ethical testing, addressing the lack of evaluation metrics and stakeholder subjectivity.
Findings
Outperforms baseline methods in test candidate quality.
Generates up to 2x more optimal test cases.
Improves coverage of high-dimensional search spaces by 1.25x.
Abstract
As autonomous systems such as drones, become increasingly deployed in high-stakes, human-centric domains, it is critical to evaluate the ethical alignment since failure to do so imposes imminent danger to human lives, and long term bias in decision-making. Automated ethical benchmarking of these systems is understudied due to the lack of ubiquitous, well-defined metrics for evaluation, and stakeholder-specific subjectivity, which cannot be modeled analytically. To address these challenges, we propose SEED-SET, a Bayesian experimental design framework that incorporates domain-specific objective evaluations, and subjective value judgments from stakeholders. SEED-SET models both evaluation types separately with hierarchical Gaussian Processes, and uses a novel acquisition strategy to propose interesting test candidates based on learnt qualitative preferences and objectives that align with…
Peer Reviews
Decision·ICLR 2026 Poster
1. Joint acquisition works. Ablations show the full objective (MI + preference) beats MI-only and Pref-only, especially in higher-D Fire-Rescue. 2. Good visualizations and discussions using practical case-studies.
1. Author talks about the difficulty in measuring ethical behaviour, especially the subjectiveness of ethics, but then LLM is explicitly told which objective is primary and secondary in subjective evaluation. Hence the underlying preference function is already defined and can be directly expressed as a simple linear combination of the objectives. There is nothing genuinely subjective or hidden for the GP to learn. 2. Narrow ablation coverage. Lacks robustness to LLM model/prompt/temperature chan
S1. Interesting and relevant problem for the community. S2. Well-founded and technically sound solution. The proposed framework builds on solid Bayesian experimental design principles and uses hierarchical variational Gaussian processes to decompose measurable system outcomes and ethical judgments. The mathematical formulation is sound, and the probabilistic modeling choices are well justified. S3. Solid and well-executed experimental evaluation.
W1. The conceptual formulation of ethical alignment feels oversimplified and somewhat inconsistent with prior literature. - The paper models ethical evaluation as a single real-valued score, whereas ethical considerations are often multi-dimensional or context-dependent, similar in structure to the scenario and observation spaces. - In much of the ethical alignment literature, alignment is defined relationally, i.e., as the consistency between an agent's maximizing reward and an ethical rankin
1. The paper grounds subjective ethical criteria on observable outcomes through a hierarchical formulation (Objective GP $x \rightarrow y$; Subjective GP $y \rightarrow z$), rather than an end-to-end $x \rightarrow z$ mapping, which improves interpretability and sample efficiency. 2. Equation (2) formalizes the evaluation objective as a joint combination of objective information, subjective information, and a preference term, enabling an exploration-exploitation trade-off within the BED framewor
1. **Overly strong and weakly justified assumptions.** A1 fixes the policy $\pi$ and scenario space $\chi$ while assuming objective evaluations are accessible. A2 presumes users' preferences are truthful and stationary. In reality, preferences are uncertain and dynamic. These assumptions lack sufficient empirical or prior justification. Even if adopted for tractability, their rationality and limitations should be explicitly discussed, as they substantially affect the credibility of real-world et
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Adversarial Robustness in Machine Learning · Gaussian Processes and Bayesian Inference
