Embedding-based Retrieval in Multimodal Content Moderation
Hanzhong Liang, Jinghao Shi, Xiang Shen, Zixuan Wang, Vera Wen, Ardalan Mehrani, Zhiqian Chen, Yifan Wu, Zhixin Zhang

TL;DR
This paper introduces an Embedding-Based Retrieval system for video content moderation that improves efficiency, adaptability, and interpretability over traditional classification methods, demonstrated through extensive offline and online experiments.
Contribution
It develops a novel embedding-based retrieval approach using supervised contrastive learning, outperforming existing contrastive methods and enhancing trend handling in video moderation.
Findings
ROC-AUC improved from 0.85 to 0.99
PR-AUC increased from 0.35 to 0.95
Operational costs reduced by over 80%
Abstract
Video understanding plays a fundamental role for content moderation on short video platforms, enabling the detection of inappropriate content. While classification remains the dominant approach for content moderation, it often struggles in scenarios requiring rapid and cost-efficient responses, such as trend adaptation and urgent escalations. To address this issue, we introduce an Embedding-Based Retrieval (EBR) method designed to complement traditional classification approaches. We first leverage a Supervised Contrastive Learning (SCL) framework to train a suite of foundation embedding models, including both single-modal and multi-modal architectures. Our models demonstrate superior performance over established contrastive learning methods such as CLIP and MoCo. Building on these embedding models, we design and implement the embedding-based retrieval system that integrates embedding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
