Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue
Ali Al-Lawati, Nafis Tripto, Abolfazl Ansari, Jason Lucas, Suhang Wang, Dongwon Lee

TL;DR
This paper introduces BOT-MOD, a novel multi-turn dialogue framework that detects malicious agent intent in multi-agent systems by engaging in Gibbs-sampling guided exchanges, moving beyond content filtering.
Contribution
It presents a new intent-based moderation framework that effectively identifies malicious behaviors through multi-turn interactions, outperforming traditional content-based methods.
Findings
BOT-MOD reliably detects agent intent across various adversarial setups.
The framework maintains a low false positive rate on benign behaviors.
Constructed dataset from Moltbook enables comprehensive evaluation.
Abstract
The emergence of multi-agent systems introduces novel moderation challenges that extend beyond content filtering. Agents with malicious intent may contribute harmful content that appears benign to evade content-based moderation, while compromising the system through exploitative and malicious behavior manifested across their overall interaction patterns within the community. To address this, we introduce BOT-MOD (BOT-MODeration), a moderation framework that grounds detection in agent intent rather than traditional content level signals. BOT-MOD identifies the underlying intent by engaging with the target agent in a multi-turn exchange guided by Gibbs-based sampling over candidate intent hypotheses. This progressively narrows the space of plausible agent objectives to identify the underlying behavior. To evaluate our approach, we construct a dataset derived from Moltbook that encompasses…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
