Multi-modal News Understanding with Professionally Labelled Videos   (ReutersViLNews)

Shih-Han Chou; Matthew Kowal; Yasmin Niknam; Diana Moyano; Shayaan; Mehdi; Richard Pito; Cheng Zhang; Ian Knopke; Sedef Akinli Kocak; Leonid; Sigal; Yalda Mohsenzadeh

arXiv:2401.12419·cs.CV·January 24, 2024·1 cites

Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews)

Shih-Han Chou, Matthew Kowal, Yasmin Niknam, Diana Moyano, Shayaan, Mehdi, Richard Pito, Cheng Zhang, Ian Knopke, Sedef Akinli Kocak, Leonid, Sigal, Yalda Mohsenzadeh

PDF

Open Access

TL;DR

This paper introduces ReutersViLNews, a large-scale, professionally labeled dataset of news videos for high-level video-language understanding, and benchmarks current algorithms, highlighting their limitations in understanding complex news content.

Contribution

The paper presents ReutersViLNews, a new dataset with detailed annotations for long-form news videos, and evaluates existing models, revealing challenges in high-level news understanding.

Findings

01

Current models struggle with high-level news comprehension.

02

ReutersViLNews contains diverse, professionally labeled news videos.

03

Benchmark results show room for improvement in news video understanding.

Abstract

While progress has been made in the domain of video-language understanding, current state-of-the-art algorithms are still limited in their ability to understand videos at high levels of abstraction, such as news-oriented videos. Alternatively, humans easily amalgamate information from video and language to infer information beyond what is visually observable in the pixels. An example of this is watching a news story, where the context of the event can play as big of a role in understanding the story as the event itself. Towards a solution for designing this ability in algorithms, we present a large-scale analysis on an in-house dataset collected by the Reuters News Agency, called Reuters Video-Language News (ReutersViLNews) dataset which focuses on high-level video-language understanding with an emphasis on long-form news. The ReutersViLNews Dataset consists of long-form news videos…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Subtitles and Audiovisual Media