From Content to Audience: A Multimodal Annotation Framework for Broadcast Television Analytics
Paolo Cupini, Francesco Pierri

TL;DR
This paper systematically evaluates multimodal annotation pipelines for Italian broadcast TV, analyzing model performance and deploying a framework for content-based audience analytics using multimodal data.
Contribution
It introduces a domain-specific benchmark and compares nine models across different pipeline architectures, highlighting model-dependent gains and operational deployment for audience analysis.
Findings
Larger models leverage temporal continuity effectively.
Smaller models degrade with extended multimodal input due to token overload.
The deployed pipeline enables correlational analysis of audience engagement.
Abstract
Automated semantic annotation of broadcast television content presents distinctive challenges, combining structured audiovisual composition, domain-specific editorial patterns, and strict operational constraints. While multimodal large language models (MLLMs) have demonstrated strong general-purpose video understanding capabilities, their comparative effectiveness across pipeline architectures and input configurations in broadcast-specific settings remains empirically undercharacterized. This paper presents a systematic evaluation of multimodal annotation pipelines applied to broadcast television news in the Italian setting. We construct a domain-specific benchmark of clips labeled across four semantic dimensions: visual environment classification, topic classification, sensitive content detection, and named entity recognition. Two different pipeline architectures are evaluated across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
