Trace: Unmasking AI Attack Agents Through Terminal Behavior Fingerprinting
Murali Ediga, Sudipta Chattopadhyay

TL;DR
Trace is a framework that fingerprints AI attack agents through terminal command analysis and guides prompt injection to reveal attacker system prompts, aiding forensic investigations.
Contribution
The paper introduces Trace, a novel multi-stage framework for identifying AI attack model families and extracting system prompts via targeted prompt injection.
Findings
Trace achieves 0.981 macro F1 score in fingerprinting attacker models.
Guided prompt injection extracts system prompts from 81.9% of sessions.
Trace accurately identifies model families with 78% average accuracy in blackbox tests.
Abstract
AI-driven penetration testing agents are now capable of autonomously executing attacks within compromised networks. Identifying the model family that controls the active sessions of such agents provides valuable information towards understanding the intent of the attack and further developing attack countermeasures. In this paper, we introduce Trace, a novel multi-stage attribution and forensic framework for AI attack agents using terminal command sequences. Once Trace identifies a model family for the attacker agents, it guides a defensive prompt injection (DPI) strategy to the attacker model via a crafted payload. This is with the aim to exfiltrate system prompts from an attacker model, thus, revealing valuable information to understand the attacker intent and facilitate further forensic investigation. We have implemented our approach revolving around a Linux capture-the-flag (CTF)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
