Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

Ali Al-Lawati; Nafis Tripto; Abolfazl Ansari; Jason Lucas; Suhang Wang; Dongwon Lee

arXiv:2605.12856·cs.AI·May 15, 2026

Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

Ali Al-Lawati, Nafis Tripto, Abolfazl Ansari, Jason Lucas, Suhang Wang, Dongwon Lee

PDF

TL;DR

This paper introduces BOT-MOD, a novel multi-turn dialogue framework that detects malicious agent intent in multi-agent systems by engaging in Gibbs-sampling guided exchanges, moving beyond content filtering.

Contribution

It presents a new intent-based moderation framework that effectively identifies malicious behaviors through multi-turn interactions, outperforming traditional content-based methods.

Findings

01

BOT-MOD reliably detects agent intent across various adversarial setups.

02

The framework maintains a low false positive rate on benign behaviors.

03

Constructed dataset from Moltbook enables comprehensive evaluation.

Abstract

The emergence of multi-agent systems introduces novel moderation challenges that extend beyond content filtering. Agents with malicious intent may contribute harmful content that appears benign to evade content-based moderation, while compromising the system through exploitative and malicious behavior manifested across their overall interaction patterns within the community. To address this, we introduce BOT-MOD (BOT-MODeration), a moderation framework that grounds detection in agent intent rather than traditional content level signals. BOT-MOD identifies the underlying intent by engaging with the target agent in a multi-turn exchange guided by Gibbs-based sampling over candidate intent hypotheses. This progressively narrows the space of plausible agent objectives to identify the underlying behavior. To evaluate our approach, we construct a dataset derived from Moltbook that encompasses…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.