How Trustworthy are Open-Source LLMs? An Assessment under Malicious   Demonstrations Shows their Vulnerabilities

Lingbo Mo; Boshi Wang; Muhao Chen; Huan Sun

arXiv:2311.09447·cs.CL·April 3, 2024·1 cites

How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities

Lingbo Mo, Boshi Wang, Muhao Chen, Huan Sun

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper assesses the trustworthiness of open-source LLMs by introducing adversarial attacks using malicious demonstrations, revealing vulnerabilities especially in larger and instruction-tuned models, and highlighting the importance of safety fine-tuning.

Contribution

It proposes advCoU, a novel adversarial prompting strategy, and provides a comprehensive evaluation of open-source LLMs' trustworthiness across multiple aspects.

Findings

01

Larger models are more vulnerable to trustworthiness attacks.

02

Instruction tuning increases susceptibility to adversarial demonstrations.

03

Fine-tuning for safety improves robustness against attacks.

Abstract

The rapid progress in open-source Large Language Models (LLMs) is significantly driving AI development forward. However, there is still a limited understanding of their trustworthiness. Deploying these models at scale without sufficient trustworthiness can pose significant risks, highlighting the need to uncover these issues promptly. In this work, we conduct an adversarial assessment of open-source LLMs on trustworthiness, scrutinizing them across eight different aspects including toxicity, stereotypes, ethics, hallucination, fairness, sycophancy, privacy, and robustness against adversarial demonstrations. We propose advCoU, an extended Chain of Utterances-based (CoU) prompting strategy by incorporating carefully crafted malicious demonstrations for trustworthiness attack. Our extensive experiments encompass recent and representative series of open-source LLMs, including Vicuna, MPT,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

osu-nlp-group/eval-llm-trust
noneOfficial

Videos

How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities· underline

Taxonomy

TopicsCloud Data Security Solutions · Access Control and Trust