MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs
Liang Shan, Kaicheng Shen, Wen Wu, Zhenyu Ying, Chaochao Lu, Yan Teng, Jingqi Huang, Guangze Ye, Guoqing Wang, Liang He

TL;DR
This paper introduces MENTOR, a framework that uses metacognition and self-evolution to identify and mitigate implicit domain risks in LLMs, significantly improving safety across various sectors.
Contribution
The paper presents a novel metacognition-driven framework that dynamically detects and reduces domain-specific risks in LLMs through self-assessment and rule-based knowledge evolution.
Findings
MENTOR reduces jailbreak success rates across multiple domains.
It achieves risk analysis performance comparable to human experts.
The framework effectively enhances LLM safety and robustness.
Abstract
Ensuring the safety of Large Language Models (LLMs) is critical for real-world deployment. However, current safety measures often fail to address implicit, domain-specific risks. To investigate this gap, we introduce a dataset of 3,000 annotated queries spanning education, finance, and management. Evaluations across 14 leading LLMs reveal a concerning vulnerability: an average jailbreak success rate of 57.8%. In response, we propose MENTOR, a metacognition-driven self-evolution framework. MENTOR first performs structured self-assessment through simulated critical thinking, such as perspective-taking and consequential reasoning to uncover latent model misalignments. These reflections are formalized into dynamic rule-based knowledge graphs that evolve with emerging risk patterns. To enforce these rules at inference time, we introduce activation steering, a method that directly modulates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Advanced Graph Neural Networks
