DeepSight: An All-in-One LM Safety Toolkit

Bo Zhang; Jiaxuan Guo; Lijun Li; Dongrui Liu; Sujin Chen; Guanxu Chen; Zhijie Zheng; Qihao Lin; Lewen Yan; Chen Qian; Yijin Zhou; Yuyao Wu; Shaoxiong Guo; Tianyi Du; Jingyi Yang; Xuhao Hu; Ziqi Miao; Xiaoya Lu; Jing Shao; Xia Hu

arXiv:2602.12092·cs.CL·February 13, 2026

DeepSight: An All-in-One LM Safety Toolkit

Bo Zhang, Jiaxuan Guo, Lijun Li, Dongrui Liu, Sujin Chen, Guanxu Chen, Zhijie Zheng, Qihao Lin, Lewen Yan, Chen Qian, Yijin Zhou, Yuyao Wu, Shaoxiong Guo, Tianyi Du, Jingyi Yang, Xuhao Hu, Ziqi Miao, Xiaoya Lu, Jing Shao, Xia Hu

PDF

Open Access 2 Models

TL;DR

DeepSight introduces an integrated, open-source safety toolkit for large language models that unifies evaluation and diagnosis, providing transparent insights into internal mechanisms and supporting frontier AI risk assessment.

Contribution

It presents the first unified safety evaluation and diagnosis toolkit for LLMs, enabling transparent, scalable, and cost-effective safety analysis.

Findings

01

Unifies safety evaluation and diagnosis in one toolkit

02

Supports frontier AI risk evaluation

03

Transforms safety analysis from black-box to white-box insights

Abstract

As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)