Detection of adversarial intent in Human-AI teams using LLMs
Abed K. Musaffar, Ambuj Singh, Francesco Bullo

TL;DR
This paper explores using large language models as real-time defenders to detect malicious behavior in human-AI teams, enhancing security against various attack vectors without task-specific training.
Contribution
It demonstrates that LLMs can effectively identify malicious actions in multi-party interactions, offering a task-agnostic defense mechanism within human-AI collaboration.
Findings
LLMs can detect malicious behavior in real-time.
Detection is effective without task-specific information.
LLMs improve robustness of human-AI teams against attacks.
Abstract
Large language models (LLMs) are increasingly deployed in human-AI teams as support agents for complex tasks such as information retrieval, programming, and decision-making assistance. While these agents' autonomy and contextual knowledge enables them to be useful, it also exposes them to a broad range of attacks, including data poisoning, prompt injection, and even prompt engineering. Through these attack vectors, malicious actors can manipulate an LLM agent to provide harmful information, potentially manipulating human agents to make harmful decisions. While prior work has focused on LLMs as attack targets or adversarial actors, this paper studies their potential role as defensive supervisors within mixed human-AI teams. Using a dataset consisting of multi-party conversations and decisions for a real human-AI team over a 25 round horizon, we formulate the problem of malicious behavior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
