MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness Repair

Changqing Li; Tianlin Li; Xiaohan Zhang; Aishan Liu; Li Pan

arXiv:2508.06963·cs.AI·August 12, 2025

MASteer: Multi-Agent Adaptive Steer Strategy for End-to-End LLM Trustworthiness Repair

Changqing Li, Tianlin Li, Xiaohan Zhang, Aishan Liu, Li Pan

PDF

Open Access

TL;DR

MASteer introduces an end-to-end framework utilizing multi-agent systems and adaptive representation engineering to improve trustworthiness in large language models efficiently and automatically.

Contribution

It is the first framework to automate trustworthiness repair in LLMs using adaptive, context-aware representation strategies with multi-agent sample generation.

Findings

01

Outperforms baselines on trustworthiness metrics.

02

Improves LLaMA-3.1-8B-Chat by 15.36%.

03

Enhances Qwen-3-8B-Chat by 4.21%.

Abstract

Large Language Models (LLMs) face persistent and evolving trustworthiness issues, motivating developers to seek automated and flexible repair methods that enable convenient deployment across diverse scenarios. Existing repair methods like supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) are costly and slow, while prompt engineering lacks robustness and scalability. Representation engineering, which steers model behavior by injecting targeted concept vectors during inference, offers a lightweight, training-free alternative. However, current approaches depend on manually crafted samples and fixed steering strategies, limiting automation and adaptability. To overcome these challenges, we propose MASteer, the first end-to-end framework for trustworthiness repair in LLMs based on representation engineering. MASteer integrates two core components: AutoTester,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Mobile Crowdsensing and Crowdsourcing · Adversarial Robustness in Machine Learning