Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks
Daniel Wen, Nafisa Hussain

TL;DR
This paper introduces a method called Directed Domain Fine-Tuning that tailors multimodal models to specific tasks by using domain-specific instructional datasets and LoRA, improving task precision with less data.
Contribution
It proposes a novel fine-tuning approach that isolates relevant domain noise, enhancing model performance on specific tasks with reduced training data.
Findings
Achieved a 2% improvement on the YouCook2 dataset.
Used significantly less training data compared to baseline.
Enhanced model focus on task-specific features.
Abstract
Large language models (LLMs) and large visual language models (LVLMs) have been at the forefront of the artificial intelligence field, particularly for tasks like text generation, video captioning, and question-answering. Typically, it is more applicable to train these models on broader knowledge bases or datasets to increase generalizability, learn relationships between topics, and recognize patterns. Instead, we propose to provide instructional datasets specific to the task of each modality within a distinct domain and then fine-tune the parameters of the model using LORA. With our approach, we can eliminate all noise irrelevant to the given task while also ensuring that the model generates with enhanced precision. For this work, we use Video-LLaVA to generate recipes given cooking videos without transcripts. Video-LLaVA's multimodal architecture allows us to provide cooking images to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
