Multimodal Language Models for Domain-Specific Procedural Video Summarization
Nafisa Hussain

TL;DR
This paper investigates fine-tuning multimodal large language models, specifically TimeChat, on domain-specific datasets to improve summarization and step-by-step instruction generation in long instructional videos within cooking and medical domains.
Contribution
It demonstrates the effectiveness of domain-specific fine-tuning of TimeChat for enhanced video summarization and instructional extraction in specialized fields.
Findings
Fine-tuning improves key step extraction accuracy.
Domain-specific datasets enhance summarization quality.
Models provide personalized, domain-relevant guidance.
Abstract
Videos serve as a powerful medium to convey ideas, tell stories, and provide detailed instructions, especially through long-format tutorials. Such tutorials are valuable for learning new skills at one's own pace, yet they can be overwhelming due to their length and dense content. Viewers often seek specific information, like precise measurements or step-by-step execution details, making it essential to extract and summarize key segments efficiently. An intelligent, time-sensitive video assistant capable of summarizing and detecting highlights in long videos is highly sought after. Recent advancements in Multimodal Large Language Models offer promising solutions to develop such an assistant. Our research explores the use of multimodal models to enhance video summarization and step-by-step instruction generation within specific domains. These models need to understand temporal events and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods
