CONDA: a CONtextual Dual-Annotated dataset for in-game toxicity   understanding and detection

Henry Weld; Guanghao Huang; Jean Lee; Tongshu Zhang; Kunze Wang,; Xinghong Guo; Siqu Long; Josiah Poon; Soyeon Caren Han

arXiv:2106.06213·cs.CL·July 26, 2021

CONDA: a CONtextual Dual-Annotated dataset for in-game toxicity understanding and detection

Henry Weld, Guanghao Huang, Jean Lee, Tongshu Zhang, Kunze Wang,, Xinghong Guo, Siqu Long, Josiah Poon, Soyeon Caren Han

PDF

1 Datasets

TL;DR

CONDA introduces a large, context-aware dataset for in-game toxicity detection, enabling joint intent classification and slot filling, and provides a comprehensive analysis of toxicity patterns in Dota 2 chat logs.

Contribution

The paper presents a novel dataset and a dual semantic-level toxicity detection framework that incorporates context, intent, and slot analysis for in-game chat toxicity understanding.

Findings

01

Strong NLU models achieve fine-grained toxicity detection results.

02

CONDA dataset covers diverse toxicity patterns and intent classes.

03

Enhanced understanding of in-game toxicity through context-aware analysis.

Abstract

Traditional toxicity detection models have focused on the single utterance level without deeper understanding of context. We introduce CONDA, a new dataset for in-game toxic language detection enabling joint intent classification and slot filling analysis, which is the core task of Natural Language Understanding (NLU). The dataset consists of 45K utterances from 12K conversations from the chat logs of 1.9K completed Dota 2 matches. We propose a robust dual semantic-level toxicity framework, which handles utterance and token-level patterns, and rich contextual chatting history. Accompanying the dataset is a thorough in-game toxicity analysis, which provides comprehensive understanding of context at utterance, token, and dual levels. Inspired by NLU, we also apply its metrics to the toxicity detection tasks for assessing toxicity and game-specific aspects. We evaluate strong NLU models on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Matrix430/CONDA
dataset· 66 dl
66 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.