Federated Learning for Video Violence Detection: Complementary Roles of Lightweight CNNs and Vision-Language Models for Energy-Efficient Use

S\'ebastien Thuau; Siba Haidar; Rachid Chelouah

arXiv:2511.07171·cs.CV·November 11, 2025

Federated Learning for Video Violence Detection: Complementary Roles of Lightweight CNNs and Vision-Language Models for Energy-Efficient Use

S\'ebastien Thuau, Siba Haidar, Rachid Chelouah

PDF

Open Access

TL;DR

This paper compares federated learning strategies for video violence detection, highlighting the energy efficiency of lightweight CNNs versus the richer reasoning capabilities of vision-language models, and proposes hybrid deployment for practical use.

Contribution

It provides the first comparative simulation study of LoRA-tuned VLMs and personalized CNNs in federated violence detection, with explicit energy and CO2e analysis.

Findings

01

3D CNN achieves ROC AUC 92.59% with half the energy cost of VLMs.

02

Hierarchical category grouping improves VLM multiclass accuracy from 65.31% to 81%.

03

All methods exceed 90% accuracy in binary violence detection.

Abstract

Deep learning-based video surveillance increasingly demands privacy-preserving architectures with low computational and environmental overhead. Federated learning preserves privacy but deploying large vision-language models (VLMs) introduces major energy and sustainability challenges. We compare three strategies for federated violence detection under realistic non-IID splits on the RWF-2000 and RLVS datasets: zero-shot inference with pretrained VLMs, LoRA-based fine-tuning of LLaVA-NeXT-Video-7B, and personalized federated learning of a 65.8M-parameter 3D CNN. All methods exceed 90% accuracy in binary violence detection. The 3D CNN achieves superior calibration (ROC AUC 92.59%) at roughly half the energy cost (240 Wh vs. 570 Wh) of federated LoRA, while VLMs provide richer multimodal reasoning. Hierarchical category grouping (based on semantic similarity and class exclusion) boosts VLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Human Pose and Action Recognition · Privacy-Preserving Technologies in Data