MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness Repair
Changqing Li, Tianlin Li, Xiaohan Zhang, Aishan Liu, Li Pan

TL;DR
MASteer introduces an end-to-end framework utilizing multi-agent systems and adaptive representation engineering to improve trustworthiness in large language models efficiently and automatically.
Contribution
It is the first framework to automate trustworthiness repair in LLMs using adaptive, context-aware representation strategies with multi-agent sample generation.
Findings
Outperforms baselines on trustworthiness metrics.
Improves LLaMA-3.1-8B-Chat by 15.36%.
Enhances Qwen-3-8B-Chat by 4.21%.
Abstract
Large Language Models (LLMs) face persistent and evolving trustworthiness issues, motivating developers to seek automated and flexible repair methods that enable convenient deployment across diverse scenarios. Existing repair methods like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) are costly and slow, while prompt engineering lacks robustness and scalability. Representation engineering, which steers model behavior by injecting targeted concept vectors during inference, offers a lightweight, training-free alternative. However, current approaches depend on manually crafted samples and fixed steering strategies, limiting automation and adaptability. To overcome these challenges, we propose MASteer, the first end-to-end framework for trustworthiness repair in LLMs based on representation engineering. MASteer integrates two core components: AutoTester,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Mobile Crowdsensing and Crowdsourcing · Adversarial Robustness in Machine Learning
