TL;DR
DeEscalWild introduces a high-quality, real-world benchmark dataset for training small language models to improve automated de-escalation training for law enforcement, emphasizing scalability and realism.
Contribution
The paper presents a novel dataset and benchmark for fine-tuning SLMs on police-civilian interactions, demonstrating superior performance over base models in de-escalation tasks.
Findings
SLMs fine-tuned on DeEscalWild outperform base models across multiple metrics.
Qwen 2.5 (3B-Instruct) surpasses Gemini 2.5 Flash in domain-specific evaluation.
The dataset enables development of low-latency, privacy-preserving officer training systems.
Abstract
Effective de-escalation is critical for law enforcement safety and community trust, yet traditional training methods lack scalability and realism. While Large Language Models (LLMs) enable dynamic, open-ended simulations, their substantial computational footprint renders them impractical for deployment on the lightweight, portable hardware required for immersive field training. Small Language Models (SLMs) offer a viable real-time alternative but suffer from a critical scarcity of high-quality, domain-specific training data. To bridge this gap, we present DeEscalWild, a novel benchmark dataset curated from a multi-stage pipeline of in-the-wild police-civilian interactions extracted from publicly available video repositories. Starting with 5,000 raw inputs, we employed a rigorous hybrid filtering process combining human-in-the-loop verification with LLM-as-a-Judge evaluation to distill…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
