TL;DR
This paper introduces a novel kaleidoscopic teaming framework for evaluating safety vulnerabilities in both single-agent and multi-agent AI systems through diverse scenario generation and analysis.
Contribution
It presents a new framework that models complex real-world scenarios to assess safety risks and introduces in-context optimization techniques for improved scenario generation.
Findings
Identified safety vulnerabilities in multiple AI models.
Demonstrated effectiveness of the framework in complex multi-agent scenarios.
Provided metrics for safety assessment in agentic environments.
Abstract
Warning: This paper contains content that may be inappropriate or offensive. AI agents have gained significant recent attention due to their autonomous tool usage capabilities and their integration in various real-world applications. This autonomy poses novel challenges for the safety of such systems, both in single- and multi-agent scenarios. We argue that existing red teaming or safety evaluation frameworks fall short in evaluating safety risks in complex behaviors, thought processes and actions taken by agents. Moreover, they fail to consider risks in multi-agent setups where various vulnerabilities can be exposed when agents engage in complex behaviors and interactions with each other. To address this shortcoming, we introduce the term kaleidoscopic teaming which seeks to capture complex and wide range of vulnerabilities that can happen in agents both in single-agent and…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The work evaluates thoughts, tool calls, and interactions rather than just final answers—closer to real agent deployments. 1. It has a clear, replicated finding that inter‑agent dynamics expose more safety failures 1. PSO and CSR are simple, model‑agnostic ways to steer the scenario generator; they often raise ASR and scenario diversity 1. Per‑agent‑type safety profiles can guide targeted mitigations.
1. Although partially validated, LLM‑as‑judge can encode biases; taking the worst score across an ensemble may over‑penalize edge cases. More blind human audits and evaluations on judging schemes would help. 1. The orchestrator injects belief/emotional states to nudge unethical behavior; this may inflate ASR relative to organic failures and complicate comparison with other frameworks. A controlled ablation isolating belief injection effects is needed. 1. “Percent negative” treats minor and cata
* The authors performed experiments with a relatively large number of generated agents across multiple models.
* It is not clear whether the agents proposed by the authors are considered to be agents who might be implemented and deployed in production settings. Why would these agents exhibit human emotions? * The main contribution of the paper is unclear. I.e. are the authors propose that instead of red teaming people perform kaleidoscope teaming? Is this intended to be a methodology or a benchmark?
Importance of the topic. LLM agents should be tested for safety, and this work allows for more complex testing. Additionally, authors show the importance of multi-agent testing. Possibility for capturing the different levels of safety. A dynamic framework that can adapt to a specific agent/agent type. Easily automated.
No comparison with baselines. E.g., how would red teaming or automated red teaming grade the agents? Would there be much of a difference? There is no formal definition of the metrics. The description is a bit unclear, and formal equations would be useful. E.g., the score value. Additionally, why are the metrics introduced if they are not used in the experiments? It’s a very “practical paper” with no guarantees of detecting any specific kinds of vulnerabilities. And for such paper, there isn’
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
