AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research

Tim Beyer; Jonas Dornbusch; Jakob Steimle; Moritz Ladenburger; Leo Schwinn; Stephan G\"unnemann

arXiv:2511.04316·cs.AI·November 7, 2025

AdversariaLLM: A Unified and Modular Toolbox for LLM Robustness Research

Tim Beyer, Jonas Dornbusch, Jakob Steimle, Moritz Ladenburger, Leo Schwinn, Stephan G\"unnemann

PDF

Open Access

TL;DR

AdversariaLLM is a comprehensive, modular toolbox designed to improve reproducibility, correctness, and comparability in LLM robustness research by integrating multiple attack algorithms, datasets, and evaluation tools.

Contribution

It introduces a unified framework for LLM robustness testing that consolidates attack methods, datasets, and evaluation metrics, addressing fragmentation and reproducibility issues.

Findings

01

Implemented 12 adversarial attack algorithms

02

Integrated 7 benchmark datasets for various evaluations

03

Provides reproducibility features like resource tracking and deterministic results

Abstract

The rapid expansion of research on Large Language Model (LLM) safety and robustness has produced a fragmented and oftentimes buggy ecosystem of implementations, datasets, and evaluation methods. This fragmentation makes reproducibility and comparability across studies challenging, hindering meaningful progress. To address these issues, we introduce AdversariaLLM, a toolbox for conducting LLM jailbreak robustness research. Its design centers on reproducibility, correctness, and extensibility. The framework implements twelve adversarial attack algorithms, integrates seven benchmark datasets spanning harmfulness, over-refusal, and utility evaluation, and provides access to a wide range of open-weight LLMs via Hugging Face. The implementation includes advanced features for comparability and reproducibility such as compute-resource tracking, deterministic results, and distributional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Artificial Intelligence in Healthcare and Education