Moment and Highlight Detection via MLLM Frame Segmentation

I Putu Andika Bagas Jiwanta; Ayu Purwarianti

arXiv:2512.12246·cs.CV·December 16, 2025

Moment and Highlight Detection via MLLM Frame Segmentation

I Putu Andika Bagas Jiwanta, Ayu Purwarianti

PDF

Open Access

TL;DR

This paper introduces a novel segmentation-based approach using Multimodal Large Language Models (MLLMs) to detect video moments and highlights directly from frame sequences, improving efficiency and performance.

Contribution

It proposes applying segmentation objectives directly on MLLM output tokens for moment and highlight detection, enabling direct frame-level predictions with fewer frames and stable training.

Findings

01

Achieved 56.74 HIT@1 on QVHighlights with only 25 frames.

02

Outperformed baseline with 35.28 MAP in moment retrieval.

03

Segmentation losses provide stable learning signals even when causal LM loss plateaus.

Abstract

Detecting video moments and highlights from natural-language queries have been unified by transformer-based methods. Other works use generative Multimodal LLM (MLLM) to predict moments and/or highlights as text timestamps, utilizing its reasoning capability. While effective, text-based generation cannot provide direct gradients for frame-level predictions because the model only emits language tokens. Although recent Reinforcement Learning (RL) methods attempt to address the issue, we propose a novel approach by applying segmentation objectives directly on the LLM's output tokens. The LLM is fed with a fixed number of frames alongside a prompt that enforces it to output a sequence of continuous "0" and/or "1" characters, with one character per frame. The "0"/"1" characters benefit from the LLM's inherent language capability while also acting as background and foreground probabilities,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Video Analysis and Summarization