LUST: A Multi-Modal Framework with Hierarchical LLM-based Scoring for Learned Thematic Significance Tracking in Multimedia Content
Anderson de Lima Luiz

TL;DR
LUST is a multi-modal framework that uses hierarchical LLM-based scoring to track thematic significance in videos, combining visual, audio, and contextual analysis for detailed content understanding.
Contribution
The paper introduces LUST, a novel hierarchical relevance scoring system that integrates multi-modal data and LLMs to analyze and quantify thematic significance in multimedia content.
Findings
Effective multi-modal relevance scoring demonstrated
Hierarchical scoring improves thematic tracking accuracy
Provides detailed annotations and analytical logs
Abstract
This paper introduces the Learned User Significance Tracker (LUST), a framework designed to analyze video content and quantify the thematic relevance of its segments in relation to a user-provided textual description of significance. LUST leverages a multi-modal analytical pipeline, integrating visual cues from video frames with textual information extracted via Automatic Speech Recognition (ASR) from the audio track. The core innovation lies in a hierarchical, two-stage relevance scoring mechanism employing Large Language Models (LLMs). An initial "direct relevance" score, , assesses individual segments based on immediate visual and auditory content against the theme. This is followed by a "contextual relevance" score, , that refines the assessment by incorporating the temporal progression of preceding thematic scores, allowing the model to understand evolving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
