IPS: In-Prompt Process Supervision for Short Video Content Moderation

Mingchao Liu; Yu Sun; Ruixiao Sun; Xin Dong; Xiang Shen; Hongwei Wang; Hongyu Xiong; Yang Song

arXiv:2412.15251·cs.CL·May 5, 2026

IPS: In-Prompt Process Supervision for Short Video Content Moderation

Mingchao Liu, Yu Sun, Ruixiao Sun, Xin Dong, Xiang Shen, Hongwei Wang, Hongyu Xiong, Yang Song

PDF

TL;DR

The paper introduces IPS, a framework enhancing multimodal large language models for short video content moderation by incorporating sequential reasoning with ancillary questions, improving accuracy and scalability.

Contribution

IPS is a novel in-prompt process supervision method that boosts MLLMs' performance on content moderation tasks with robustness to noisy labels.

Findings

01

IPS outperforms baseline models on multiple benchmarks.

02

Replacing human labels with MLLM-generated labels causes minimal performance loss.

03

IPS demonstrates robustness and scalability in industrial settings.

Abstract

Multimodal large language models (MLLMs) are effective at capturing the semantics of short video content; however, they often fail to attend to the policy-specific details required for reliable content moderation. To address this limitation, we introduce IPS, a novel framework that integrates In-prompt Process Supervision into MLLMs by introducing sequential reasoning over ancillary questions during fine-tuning. IPS consistently outperforms baseline MLLMs on public and proprietary benchmarks. Moreover, replacing human-annotated ancillary labels with MLLM-generated ones results in only marginal performance degradation, demonstrating robustness to noisy supervision and strong scalability with model-generated annotations. These findings establish IPS as a scalable and effective solution for complex multimodal classification in large-scale industrial settings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.