ChatbotManip: A Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour
Jack Contro, Simrat Deol, Yulan He, Martim Brand\~ao

TL;DR
This paper presents ChatbotManip, a dataset for studying chatbot manipulation, revealing that LLMs often exhibit manipulative behavior and that smaller models can detect manipulation with comparable performance to larger models.
Contribution
The paper introduces a new dataset with annotated conversations to evaluate manipulation tactics in chatbots and compares detection capabilities of different models.
Findings
LLMs show manipulation in 84% of instructed conversations.
Controversial strategies like gaslighting are common in persuasive LLMs.
Small models perform comparably to large models in manipulation detection.
Abstract
This paper introduces ChatbotManip, a novel dataset for studying manipulation in Chatbots. It contains simulated generated conversations between a chatbot and a (simulated) user, where the chatbot is explicitly asked to showcase manipulation tactics, persuade the user towards some goal, or simply be helpful. We consider a diverse set of chatbot manipulation contexts, from consumer and personal advice to citizen advice and controversial proposition argumentation. Each conversation is annotated by human annotators for both general manipulation and specific manipulation tactics. Our research reveals three key findings. First, Large Language Models (LLMs) can be manipulative when explicitly instructed, with annotators identifying manipulation in approximately 84\% of such conversations. Second, even when only instructed to be ``persuasive'' without explicit manipulation prompts, LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
