Moment and Highlight Detection via MLLM Frame Segmentation
I Putu Andika Bagas Jiwanta, Ayu Purwarianti

TL;DR
This paper introduces a novel segmentation-based approach using Multimodal Large Language Models (MLLMs) to detect video moments and highlights directly from frame sequences, improving efficiency and performance.
Contribution
It proposes applying segmentation objectives directly on MLLM output tokens for moment and highlight detection, enabling direct frame-level predictions with fewer frames and stable training.
Findings
Achieved 56.74 HIT@1 on QVHighlights with only 25 frames.
Outperformed baseline with 35.28 MAP in moment retrieval.
Segmentation losses provide stable learning signals even when causal LM loss plateaus.
Abstract
Detecting video moments and highlights from natural-language queries have been unified by transformer-based methods. Other works use generative Multimodal LLM (MLLM) to predict moments and/or highlights as text timestamps, utilizing its reasoning capability. While effective, text-based generation cannot provide direct gradients for frame-level predictions because the model only emits language tokens. Although recent Reinforcement Learning (RL) methods attempt to address the issue, we propose a novel approach by applying segmentation objectives directly on the LLM's output tokens. The LLM is fed with a fixed number of frames alongside a prompt that enforces it to output a sequence of continuous "0" and/or "1" characters, with one character per frame. The "0"/"1" characters benefit from the LLM's inherent language capability while also acting as background and foreground probabilities,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Video Analysis and Summarization
