AutoQD: Automatic Discovery of Diverse Behaviors with Quality-Diversity Optimization
Saeed Hedayatian, Stefanos Nikolaidis

TL;DR
AutoQD introduces a theoretically grounded method to automatically generate behavioral descriptors for quality-diversity algorithms by embedding policy occupancy measures, enabling diverse policy discovery without predefined descriptors.
Contribution
The paper presents AutoQD, a novel approach that automatically generates behavioral descriptors using occupancy measure embeddings, eliminating the need for handcrafted descriptors in QD algorithms.
Findings
AutoQD effectively discovers diverse policies in continuous control tasks.
Embeddings converge to true MMD distances with increased samples and dimensions.
The method outperforms prior approaches relying on predefined behavioral descriptors.
Abstract
Quality-Diversity (QD) algorithms have shown remarkable success in discovering diverse, high-performing solutions, but rely heavily on hand-crafted behavioral descriptors that constrain exploration to predefined notions of diversity. Leveraging the equivalence between policies and occupancy measures, we present a theoretically grounded approach to automatically generate behavioral descriptors by embedding the occupancy measures of policies in Markov Decision Processes. Our method, AutoQD, leverages random Fourier features to approximate the Maximum Mean Discrepancy (MMD) between policy occupancy measures, creating embeddings whose distances reflect meaningful behavioral differences. A low-dimensional projection of these embeddings that captures the most behaviorally significant dimensions can then be used as behavioral descriptors for CMA-MAE, a state of the art blackbox QD method, to…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is theoretically sound, rigorously connecting occupancy measures, MMD, and Random Fourier Features to create a principled and efficient metric for behavioral distance. - The proposed policy embedding method is a versatile contribution with significant potential beyond QD. This technique for representing policy behavior could be applied to other RL tasks, making it a valuable tool for the broader community. - The experiments are extensive and includes a diverse set of environments
- There exist a gap between the theoretical guarantees and the practical implementation of the policy embedding. Theorem 1 provides a powerful result for embeddings ($\phi^\pi$) constructed from i.i.d. samples drawn from the occupancy measure. However, the paper acknowledges that this sampling strategy is too inefficient for practical use. Instead, the algorithm uses a different estimator ($ \psi^\pi$ from Eq. 6) that averages features over all transitions in a trajectory. The paper lacks a form
[Originality] Proposes a new theoretically motivated connection between occupancy measures and QD behavior descriptors, moving beyond handwritten heuristics. [Quality] Method is modular and seems to be somewhat compatible with different QD optimization methods (authors only test CMA-MAE QD methods, and I have a question regarding gradient-based methods) [Quality] Implementation details are well-documented in Appendix D. The code is well structured (I haven't run it, but I have read some of it
[Method] - Computing accurate embeddings requires many trajectories, which might be very sample inefficient in long-horizon or high-variance tasks. - The kernel bandwidth and embedding dimension are fixed globally, but they could have a significant influence on performance. [Experiments] - Narrow experimental scope. The experiments are mainly against QD methods that also use CMA-MAE (The experiments also include another evolutionary method (DvD-ES) and an RL method (SMERL) which are not v
1. The paper is easy to follow. 2. Solid theoretical support; Eliminates reliance on hand-crafted BDs; Compatible with existing QD algorithms and verified on mainstream tasks.
1. Low sample efficiency, needing many trajectories in stochastic environments. 2. Low-dimensional BDs may miss complex behaviors (e.g., ignoring leg-lifting in Walker2d). 3. Inferior maximum fitness compared to RL methods. 4. Fixed kernel bandwidth; poor scalability with large policy networks. 5. (Minor but Worth Discussing) A minor yet notable point worth discussing is the subjectivity of the concept of "diversity" itself. Diversity in QD optimization is inherently user-defined, as differe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Artificial Intelligence in Games
