HAMSA: Hijacking Aligned Compact Models via Stealthy Automation

Alexey Krylov; Iskander Vagizov; Dmitrii Korzh; Maryam Douiba; Azidine Guezzaz; Vladimir Kokh; Sergey D. Erokhin; Elena V. Tutubalina; Oleg Y. Rogov

arXiv:2508.16484·cs.CL·August 25, 2025

HAMSA: Hijacking Aligned Compact Models via Stealthy Automation

Alexey Krylov, Iskander Vagizov, Dmitrii Korzh, Maryam Douiba, Azidine Guezzaz, Vladimir Kokh, Sergey D. Erokhin, Elena V. Tutubalina, Oleg Y. Rogov

PDF

TL;DR

This paper introduces HAMSA, an automated evolutionary framework that generates stealthy, coherent jailbreak prompts for compact LLMs, revealing vulnerabilities in alignment safeguards across multiple languages.

Contribution

It presents a novel multi-stage evolutionary search method for creating natural, effective jailbreak prompts, advancing automated red-teaming for aligned language models.

Findings

01

Successfully bypassed alignment safeguards in benchmark tests

02

Generated prompts maintain high natural language fluency

03

Demonstrated effectiveness in multilingual settings

Abstract

Large Language Models (LLMs), especially their compact efficiency-oriented variants, remain susceptible to jailbreak attacks that can elicit harmful outputs despite extensive alignment efforts. Existing adversarial prompt generation techniques often rely on manual engineering or rudimentary obfuscation, producing low-quality or incoherent text that is easily flagged by perplexity-based filters. We present an automated red-teaming framework that evolves semantically meaningful and stealthy jailbreak prompts for aligned compact LLMs. The approach employs a multi-stage evolutionary search, where candidate prompts are iteratively refined using a population-based strategy augmented with temperature-controlled variability to balance exploration and coherence preservation. This enables the systematic discovery of prompts capable of bypassing alignment safeguards while maintaining natural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.