IPS: In-Prompt Process Supervision for Short Video Content Moderation
Mingchao Liu, Yu Sun, Ruixiao Sun, Xin Dong, Xiang Shen, Hongwei Wang, Hongyu Xiong, Yang Song

TL;DR
The paper introduces IPS, a framework enhancing multimodal large language models for short video content moderation by incorporating sequential reasoning with ancillary questions, improving accuracy and scalability.
Contribution
IPS is a novel in-prompt process supervision method that boosts MLLMs' performance on content moderation tasks with robustness to noisy labels.
Findings
IPS outperforms baseline models on multiple benchmarks.
Replacing human labels with MLLM-generated labels causes minimal performance loss.
IPS demonstrates robustness and scalability in industrial settings.
Abstract
Multimodal large language models (MLLMs) are effective at capturing the semantics of short video content; however, they often fail to attend to the policy-specific details required for reliable content moderation. To address this limitation, we introduce IPS, a novel framework that integrates In-prompt Process Supervision into MLLMs by introducing sequential reasoning over ancillary questions during fine-tuning. IPS consistently outperforms baseline MLLMs on public and proprietary benchmarks. Moreover, replacing human-annotated ancillary labels with MLLM-generated ones results in only marginal performance degradation, demonstrating robustness to noisy supervision and strong scalability with model-generated annotations. These findings establish IPS as a scalable and effective solution for complex multimodal classification in large-scale industrial settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
