HiLight: Technical Report on the Motern AI Video Language Model
Zhiting Wang, Qiangong Zhou, Kangjie Yang, Zongyang Liu, Xin Mao

TL;DR
This paper introduces HiLight, a video-language model with dual visual towers for video-text alignment and user interaction, focusing on video comprehension in billiards.
Contribution
It presents a novel video encoder and conversation framework for improved video-text alignment and user interaction in video comprehension tasks.
Findings
Effective video-text alignment achieved
Enhanced video conversation capabilities demonstrated
Application to billiards video understanding
Abstract
This technical report presents the implementation of a state-of-the-art video encoder for video-text modal alignment and a video conversation framework called HiLight, which features dual visual towers. The work is divided into two main parts: 1.alignment of video and text modalities; 2.convenient and efficient way to interact with users. Our goal is to address the task of video comprehension in the context of billiards. The report includes a discussion of the concepts and the final solution developed during the task's implementation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Machine Learning and Data Classification · Robotics and Automated Systems
