Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision   Models For Video Captioning and Summarization

Richard Luo; Austin Peng; Adithya Vasudev; and Rishabh Jain

arXiv:2405.20648·cs.CV·October 29, 2024

Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization

Richard Luo, Austin Peng, Adithya Vasudev, and Rishabh Jain

PDF

Open Access 1 Repo

TL;DR

Shotluck Holmes introduces a family of efficient small-scale large language vision models that significantly improve video captioning and summarization by understanding shot-by-shot semantic information with less computational cost.

Contribution

The paper presents a novel family of small, efficient LLVMs that extend visual understanding from images to videos, enhancing captioning and summarization capabilities.

Findings

01

Outperforms state-of-the-art on Shot2Story task

02

Uses less computational resources than larger models

03

Achieves better accuracy in video understanding

Abstract

Video is an increasingly prominent and information-dense medium, yet it poses substantial challenges for language models. A typical video consists of a sequence of shorter segments, or shots, that collectively form a coherent narrative. Each shot is analogous to a word in a sentence where multiple data streams of information (such as visual and auditory data) must be processed simultaneously. Comprehension of the entire video requires not only understanding the visual-audio information of each shot but also requires that the model links the ideas between each shot to generate a larger, all-encompassing story. Despite significant progress in the field, current works often overlook videos' more granular shot-by-shot semantic information. In this project, we propose a family of efficient large language vision models (LLVMs) to boost video summarization and captioning called Shotluck…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Skyline-9/Shotluck-Holmes
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization