Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific   Training Tasks

Daniel Wen; Nafisa Hussain

arXiv:2406.16346·cs.CV·June 25, 2024

Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks

Daniel Wen, Nafisa Hussain

PDF

Open Access

TL;DR

This paper introduces a method called Directed Domain Fine-Tuning that tailors multimodal models to specific tasks by using domain-specific instructional datasets and LoRA, improving task precision with less data.

Contribution

It proposes a novel fine-tuning approach that isolates relevant domain noise, enhancing model performance on specific tasks with reduced training data.

Findings

01

Achieved a 2% improvement on the YouCook2 dataset.

02

Used significantly less training data compared to baseline.

03

Enhanced model focus on task-specific features.

Abstract

Large language models (LLMs) and large visual language models (LVLMs) have been at the forefront of the artificial intelligence field, particularly for tasks like text generation, video captioning, and question-answering. Typically, it is more applicable to train these models on broader knowledge bases or datasets to increase generalizability, learn relationships between topics, and recognize patterns. Instead, we propose to provide instructional datasets specific to the task of each modality within a distinct domain and then fine-tune the parameters of the model using LORA. With our approach, we can eliminate all noise irrelevant to the given task while also ensuring that the model generates with enhanced precision. For this work, we use Video-LLaVA to generate recipes given cooking videos without transcripts. Video-LLaVA's multimodal architecture allows us to provide cooking images to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling