A Systematic Study of Training-Free Methods for Trustworthy Large Language Models
Wai Man Si, Mingjie Li, Michael Backes, Yang Zhang

TL;DR
This paper systematically evaluates training-free methods for improving trustworthiness in large language models, analyzing their effectiveness, trade-offs, and limitations across various settings and model types.
Contribution
It introduces a taxonomy of training-free methods based on their intervention points and provides a comprehensive analysis of their impacts on trustworthiness, utility, and robustness.
Findings
Training-free methods vary in effectiveness across trustworthiness dimensions.
Trade-offs exist between trustworthiness improvements and utility degradation.
The study identifies unresolved challenges and offers practical recommendations.
Abstract
As Large Language Models (LLMs) receive increasing attention and are being deployed across various domains, their potential risks, including generating harmful or biased content, producing unsupported claims, and exhibiting vulnerabilities to adversarial attacks, have drawn significant attention. To enable quick and low-cost adaptation, training-free methods have recently emerged as cost-effective alternatives to post-training alignment techniques. Despite their promising results, these methods are evaluated inconsistently across the literature, cover limited dimensions of trustworthiness, and can introduce undesirable side effects, such as utility degradation and increased brittleness. To fully assess the impacts of these training-free methods, we take a step back and systematically re-evaluate the effectiveness of existing training-free methods against various trustworthy settings and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
