HiLight: Technical Report on the Motern AI Video Language Model

Zhiting Wang; Qiangong Zhou; Kangjie Yang; Zongyang Liu; Xin Mao

arXiv:2407.07325·cs.CV·July 12, 2024

HiLight: Technical Report on the Motern AI Video Language Model

Zhiting Wang, Qiangong Zhou, Kangjie Yang, Zongyang Liu, Xin Mao

PDF

Open Access

TL;DR

This paper introduces HiLight, a video-language model with dual visual towers for video-text alignment and user interaction, focusing on video comprehension in billiards.

Contribution

It presents a novel video encoder and conversation framework for improved video-text alignment and user interaction in video comprehension tasks.

Findings

01

Effective video-text alignment achieved

02

Enhanced video conversation capabilities demonstrated

03

Application to billiards video understanding

Abstract

This technical report presents the implementation of a state-of-the-art video encoder for video-text modal alignment and a video conversation framework called HiLight, which features dual visual towers. The work is divided into two main parts: 1.alignment of video and text modalities; 2.convenient and efficient way to interact with users. Our goal is to address the task of video comprehension in the context of billiards. The report includes a discussion of the concepts and the final solution developed during the task's implementation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Machine Learning and Data Classification · Robotics and Automated Systems