LUST: A Multi-Modal Framework with Hierarchical LLM-based Scoring for Learned Thematic Significance Tracking in Multimedia Content

Anderson de Lima Luiz

arXiv:2508.04353·cs.MM·August 7, 2025

LUST: A Multi-Modal Framework with Hierarchical LLM-based Scoring for Learned Thematic Significance Tracking in Multimedia Content

Anderson de Lima Luiz

PDF

TL;DR

LUST is a multi-modal framework that uses hierarchical LLM-based scoring to track thematic significance in videos, combining visual, audio, and contextual analysis for detailed content understanding.

Contribution

The paper introduces LUST, a novel hierarchical relevance scoring system that integrates multi-modal data and LLMs to analyze and quantify thematic significance in multimedia content.

Findings

01

Effective multi-modal relevance scoring demonstrated

02

Hierarchical scoring improves thematic tracking accuracy

03

Provides detailed annotations and analytical logs

Abstract

This paper introduces the Learned User Significance Tracker (LUST), a framework designed to analyze video content and quantify the thematic relevance of its segments in relation to a user-provided textual description of significance. LUST leverages a multi-modal analytical pipeline, integrating visual cues from video frames with textual information extracted via Automatic Speech Recognition (ASR) from the audio track. The core innovation lies in a hierarchical, two-stage relevance scoring mechanism employing Large Language Models (LLMs). An initial "direct relevance" score, $S_{d, i}$ , assesses individual segments based on immediate visual and auditory content against the theme. This is followed by a "contextual relevance" score, $S_{c, i}$ , that refines the assessment by incorporating the temporal progression of preceding thematic scores, allowing the model to understand evolving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.