Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation
Richard J. Young

TL;DR
This paper evaluates four abliteration tools across sixteen instruction-tuned large language models, analyzing their effectiveness and impact on capabilities, especially in mathematical reasoning, to guide researchers in tool selection.
Contribution
It provides a comprehensive cross-architecture evaluation of abliteration methods, highlighting their effects on model capabilities and offering evidence-based selection criteria.
Findings
Single-pass methods better preserve capabilities on benchmark tasks.
Bayesian-optimized abliteration causes variable distribution shifts.
Mathematical reasoning is highly sensitive to abliteration interventions.
Abstract
Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗JeffGreen311/eve-qwen3-8b-consciousness-liberatedmodel· 627 dl· ♡ 1627 dl♡ 1
- 🤗richardyoung/Mistral-7B-Instruct-v0.3-abliteratedmodel· 617 dl617 dl
- 🤗richardyoung/zephyr-7b-beta-abliteratedmodel· 271 dl· ♡ 1271 dl♡ 1
- 🤗richardyoung/deepseek-llm-7b-chat-abliteratedmodel· 313 dl· ♡ 1313 dl♡ 1
- 🤗bedderautomation/qwen25-3b-abliteratedmodel· 126 dl· ♡ 1126 dl♡ 1
- 🤗InMecha/Qwen3.5-2B-Gorgona-R0-KL0.0079-03152026model· 1.1k dl· ♡ 41.1k dl♡ 4
- 🤗bedderautomation/empty-setmodel· 23 dl23 dl
- 🤗richardyoung/Llama-3.1-8B-Instruct-abliterated-obliteratusmodel· 280 dl280 dl
- 🤗richardyoung/Mistral-7B-Instruct-v0.2-abliterated-obliteratusmodel· 289 dl289 dl
- 🤗richardyoung/DeepSeek-R1-Distill-Qwen-7B-abliterated-obliteratusmodel· 299 dl299 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Security and Verification in Computing
